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Preface 


The  past  25  years  have  seen  great  advances  in  both  Bayesian  and  frequentist 
methods  for  data  analysis.  The  most  significant  advance  for  the  Bayesian  approach 
has  been  the  development  of  Markov  chain  Monte  Carlo  methods  for  estimating 
expectations  with  respect  to  the  posterior,  hence  allowing  flexible  inference  and 
routine  implementation  for  a  wide  range  of  models.  In  particular,  this  development 
has  led  to  the  more  widespread  use  of  hierarchical  models  for  dependent  data.  With 
respect  to  frequentist  methods,  estimating  functions  have  emerged  as  a  unifying 
approach  for  determining  the  properties  of  estimators.  Generalized  estimating 
equations  provide  a  particularly  important  example  of  this  methodology  that  allows 
inference  for  dependent  data. 

The  aim  of  this  book  is  to  provide  a  modern  description  of  Bayesian  and 
frequentist  methods  of  regression  analysis  and  to  illustrate  the  use  of  these  methods 
on  real  data.  Many  books  describe  one  or  the  other  of  the  Bayesian  or  frequentist 
approaches  to  regression  modeling  in  different  contexts,  and  many  mathematical 
statistics  texts  describe  the  theory  behind  Bayesian  and  frequentist  approaches 
without  providing  a  detailed  description  of  specific  methods.  References  to  such 
texts  are  given  at  the  end  of  Chaps.  2  and  3.  Bayesian  and  frequentist  methods  are 
not  viewed  here  as  competitive,  but  rather  as  complementary  techniques,  and  in  this 
respect  this  book  has  some  uniqueness. 

In  embarking  on  the  writing  of  this  book,  I  have  been  influenced  by  many  current 
and  former  colleagues.  My  early  training  was  in  the  Mathematics  Department  at 
the  University  of  Nottingham  and  my  first  permanent  academic  teaching  position 
was  in  the  Mathematics  Department  at  Imperial  College  of  Science,  Technology 
and  Medicine  in  London.  During  this  period  I  was  introduced  to  the  Bayesian 
paradigm  and  was  greatly  influenced  by  Adrian  Smith,  both  as  a  lecturer  and  as 
a  Ph.D.  adviser.  I  have  also  benefited,  and  continue  to  benefit,  from  numerous 
conversations  with  Dave  Stephens  who  I  have  known  for  over  25  years.  Following 
my  move  to  the  University  of  Washington  in  Seattle  I  was  exposed  to  a  very  modern 
view  of  frequentist  methods  in  the  Department  of  Biostatistics.  In  particular,  Scott 
Emerson,  Patrick  Heagerty  and  Thomas  Lumley  have  provided  constant  stimulation. 
These  interactions,  among  many  others,  have  influenced  the  way  I  now  think  about 
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statistics,  and  it  is  this  exposure  which  1  hope  has  allowed  me  to  write  a  balanced 
account  of  Bayesian  and  frequentist  methods.  There  is  some  theory  in  this  book  and 
some  data  analysis,  but  the  focus  is  on  material  that  lies  between  these  endeavors 
and  concerns  methods.  At  the  University  of  Washington  there  is  an  advanced  three- 
course  regression  methods  sequence  and  this  book  arose  out  of  my  teaching  of  the 
three  courses  in  the  sequence. 

If  modern  computers  had  been  available  a  100  years  ago,  the  discipline  of 
statistics  would  have  developed  in  a  dramatically  different  fashion  to  the  way  in 
which  it  actually  evolved.  In  particular,  there  would  probably  be  less  dependence  on 
linear  and  generalized  linear  models,  which  are  mathematically  and  computationally 
convenient.  While  these  model  classes  are  still  useful  and  do  possess  a  number 
of  convenient  mathematical  and  computational  properties,  I  believe  they  should  be 
viewed  as  just  two  choices  within  a  far  wider  range  of  models  that  are  now  available. 
The  approach  to  modeling  that  is  encouraged  in  this  book  is  to  first  specify  the 
model  suggested  by  the  background  science  and  to  then  proceed  to  examining  the 
mathematical  and  computational  aspects  of  the  model. 

As  a  preparation  for  this  book,  the  reader  is  assumed  to  have  a  grasp  of  calculus 
and  linear  algebra  and  have  taken  first  courses  in  probability  and  statistical  theory. 
The  content  of  this  book  is  as  follows.  An  introductory  chapter  describes  a  number 
of  motivating  examples  and  discusses  general  issues  that  need  consideration  before 
a  regression  analysis  is  carried  out.  This  book  is  then  broken  into  five  parts:  1,  In¬ 
ferential  Approaches;  II,  Independent  Data;  III,  Dependent  Data;  IV,  Nonparametric 
Modeling;  V,  Appendices.  The  first  two  chapters  of  Part  I  provide  descriptions  of  the 
frequentist  and  Bayesian  approaches  to  inference,  with  a  particular  emphasis  on  the 
rationale  of  each  approach  and  a  delineation  of  situations  in  which  one  or  the  other 
approach  is  preferable.  The  third  chapter  in  Part  I  discusses  model  selection  and 
hypothesis  testing.  Part  II  considers  independent  data  and  contains  three  chapters  on 
the  linear  model,  general  regression  models  (including  generalized  linear  models), 
and  binary  data  models.  The  two  chapters  of  Part  III  consider  dependent  data 
with  linear  models  and  general  regression  models.  Mixed  models  and  generalized 
estimating  equations  are  the  approaches  to  inference  that  are  emphasized.  Part  IV 
contains  three  chapters  on  nonparametric  modeling  with  an  emphasis  on  spline  and 
kernel  methods.  The  examples  and  simulation  studies  of  this  book  were  almost 
exclusively  carried  out  within  the  freely  available  R  programming  environment.  The 
code  for  the  examples  and  figures  may  be  found  at: 

http://faculty.washington.edu/jonno/regression-methods.html 

along  with  the  inevitable  errata  and  links  to  datasets.  Exercises  are  included  at 
the  end  of  all  chapters  but  the  first.  Many  of  these  exercises  concern  analyses  of 
real  data.  In  my  own  experience,  a  full  understanding  of  methods  requires  their 
implementation  and  application  to  data. 

In  my  own  teaching  I  have  based  three  one-quarter  courses  on  the  following. 
Regression  Methods  for  Independent  Data  is  based  on  Part  II,  dipping  into  topics  in 
Part  I  as  needed  and  using  motivating  examples  from  Chap.  1 .  Regression  Methods 
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for  Dependent  Data  centers  on  Part  II,  again  using  examples  from  Chap.  1,  and 
building  on  the  independent  data  material.  Finally,  Nonparametric  Regression  and 
Classification  is  based  on  the  material  in  Part  IV.  The  latter  course  is  stand-alone  in 
the  sense  of  not  requiring  the  independent  and  dependent  data  courses  though  extra 
material  on  a  number  of  topics,  including  linear  and  generalized  linear  models  and 
mixed  models,  will  need  to  be  included  if  not  previously  encountered. 

In  the  2003-2004  academic  year  I  was  the  Genentech  Professor  and  received 
funding  specifically  to  work  on  this  book.  The  staff  at  Springer  have  been  very 
helpful  at  all  stages.  John  Kimmel  was  the  editor  during  most  of  the  writing  of  this 
book  and  I  am  appreciative  of  his  gentle  prodding  and  advice.  About  18  months 
from  the  completion  of  this  book,  Marc  Strauss  stepped  in  and  has  also  been  very 
supportive.  Many  of  my  colleagues  have  given  comments  on  various  chapters,  but 
I  would  like  to  specifically  thank  Lurdes  Inoue,  Katie  Kerr,  Erica  Moodie,  Zoe 
Moodie,  Ken  Rice,  Dave  Stephens,  Jon  Wellner,  Daniela  Witten,  and  Simon  Wood 
for  feedback  on  different  parts  of  this  book.  Finally,  lest  we  forget,  I  would  like 
to  thank  all  of  those  students  who  suffered  through  initial  presentations  of  this 
material — I  hope  your  sacrifices  were  not  in  vain. . . 

Seattle,  WA  Jon  Wakefield 

June  2012 


Contents 


1  Introduction  and  Motivating  Examples .  1 

1.1  Introduction .  1 

1 .2  Model  Formulation .  1 

1.3  Motivating  Examples .  5 

1.3.1  Prostate  Cancer .  5 

1 .3.2  Outcome  After  Head  Injury .  9 

1.3.3  Lung  Cancer  and  Radon .  10 

1.3.4  Pharmacokinetic  Data .  12 

1.3.5  Dental  Growth .  16 

1.3.6  Spinal  Bone  Mineral  Density .  18 

1.4  Nature  of  Randomness .  20 

1 .5  Bayesian  and  Frequentist  Inference .  22 

1.6  The  Executive  Summary .  23 

1 .7  Bibliographic  Notes .  24 

Part  I  Inferential  Approaches 

2  Frequentist  Inference  .  27 

2.1  Introduction .  27 

2.2  Frequentist  Criteria .  29 

2.3  Estimating  Functions .  32 

2.4  Likelihood .  36 

2.4. 1  Maximum  Likelihood  Estimation .  36 

2.4.2  Variants  on  Likelihood .  44 

2.4.3  Model  Misspecification .  46 

2.5  Quasi-likelihood .  49 

2.5.1  Maximum  Quasi-likelihood  Estimation .  49 

2.5.2  A  More  Complex  Mean- Variance  Model .  53 

2.6  Sandwich  Estimation .  56 

2.7  Bootstrap  Methods .  63 

2.7.1  The  Bootstrap  for  a  Univariate  Parameter .  64 

xi 


Contents 


xii 


2.7.2  The  Bootstrap  for  Regression .  66 

2.7.3  Sandwich  Estimation  and  the  Bootstrap  .  66 

2.8  Choice  of  Estimating  Function .  70 

2.9  Hypothesis  Testing .  72 

2.9.1  Motivation .  72 

2.9.2  Preliminaries .  73 

2.9.3  Score  Tests .  74 

2.9.4  Wald  Tests .  75 

2.9.5  Likelihood  Ratio  Tests .  75 

2.9.6  Quasi-likelihood .  76 

2.9.7  Comparison  of  Test  Statistics .  77 

2.10  Concluding  Remarks .  79 

2.11  Bibliographic  Notes .  80 

2.12  Exercises .  80 

3  Bayesian  Inference .  85 

3.1  Introduction .  85 

3.2  The  Posterior  Distribution  and  Its  Summarization .  86 

3.3  Asymptotic  Properties  of  Bayesian  Estimators .  89 

3.4  Prior  Choice .  90 

3.4.1  Baseline  Priors  .  90 

3.4.2  Substantive  Priors .  93 

3.4.3  Priors  on  Meaningful  Scales .  95 

3.4.4  Frequentist  Considerations  .  96 

3.5  Model  Misspecification .  99 

3.6  Bayesian  Model  Averaging .  100 

3.7  Implementation .  102 

3.7.1  Conjugacy .  102 

3.7.2  Laplace  Approximation .  106 

3.7.3  Quadrature .  107 

3.7.4  Integrated  Nested  Laplace  Approximations .  109 

3.7.5  Importance  Sampling  Monte  Carlo .  110 

3.7.6  Direct  Sampling  Using  Conjugacy .  112 

3.7.7  Direct  Sampling  Using  the  Rejection  Algorithm .  114 

3.8  Markov  Chain  Monte  Carlo .  121 

3.8.1  Markov  Chains  for  Exploring  Posterior  Distributions .. .  121 

3.8.2  The  Metropolis-Hastings  Algorithm .  122 

3.8.3  The  Metropolis  Algorithm .  123 

3.8.4  The  Gibbs  Sampler .  123 

3.8.5  Combining  Markov  Kernels:  Hybrid  Schemes .  125 

3.8.6  Implementation  Details .  125 

3.8.7  Implementation  Summary .  133 

3.9  Exchangeability .  134 

3.10  Hypothesis  Testing  with  Bayes  Factors .  137 

3.11  Bayesian  Inference  Based  on  a  Sampling  Distribution  .  140 

3.12  Concluding  Remarks .  143 


Contents 


xiii 

3.13  Bibliographic  Notes .  145 

3.14  Exercises .  145 

4  Hypothesis  Testing  and  Variable  Selection  .  153 

4.1  Introduction .  153 

4.2  Frequentist  Hypothesis  Testing .  153 

4.2.1  Fisherian  Approach .  154 

4.2.2  Neyman-Pearson  Approach .  154 

4.2.3  Critique  of  the  Fisherian  Approach .  154 

4.2.4  Critique  of  the  Neyman-Pearson  Approach .  155 

4.3  Bayesian  Hypothesis  Testing  with  Bayes  Factors .  156 

4.3.1  Overview  of  Approaches .  156 

4.3.2  Critique  of  the  Bayes  Factor  Approach .  158 

4.3.3  A  Bayesian  View  of  Frequentist  Hypothesis  Testing _  159 

4.4  The  Jeffrey s-Lindley  Paradox .  161 

4.5  Testing  Multiple  Hypotheses:  General  Considerations  .  164 

4.6  Testing  Multiple  Hypotheses:  Fixed  Number  of  Tests .  165 

4.6.1  Frequentist  Analysis  .  166 

4.6.2  Bayesian  Analysis .  171 

4.7  Testing  Multiple  Hypotheses:  Variable  Selection .  178 

4.8  Approaches  to  Variable  Selection  and  Modeling .  179 

4.8.1  Stepwise  Methods .  181 

4.8.2  All  Possible  Subsets .  183 

4.8.3  Bayesian  Model  Averaging .  185 

4.8.4  Shrinkage  Methods .  185 

4.9  Model  Building  Uncertainty .  185 

4.10  A  Pragmatic  Compromise  to  Variable  Selection .  188 

4.11  Concluding  Comments .  189 

4.12  Bibliographic  Notes .  190 

4.13  Exercises .  190 

Part  II  Independent  Data 

5  Linear  Models .  195 

5.1  Introduction .  195 

5.2  Motivating  Example:  Prostate  Cancer .  195 

5.3  Model  Specification .  196 

5.4  A  Justification  for  Linear  Modeling .  198 

5.5  Parameter  Interpretation .  199 

5.5.1  Causation  Versus  Association .  199 

5.5.2  Multiple  Parameters .  201 

5.5.3  Data  Transformations .  205 

5.6  Frequentist  Inference .  209 

5.6.1  Likelihood .  209 

5.6.2  Least  Squares  Estimation .  214 


XIV 


Contents 


5.6.3  The  Gauss-Markov  Theorem .  215 

5.6.4  Sandwich  Estimation .  216 

5.7  Bayesian  Inference .  221 

5.8  Analysis  of  Variance .  224 

5.8.1  One-Way  ANOVA .  224 

5.8.2  Crossed  Designs .  227 

5.8.3  Nested  Designs .  229 

5.8.4  Random  and  Mixed  Effects  Models .  230 

5.9  Bias-Variance  Trade-Off .  231 

5.10  Robustness  to  Assumptions  .  236 

5.10.1  Distribution  of  Errors .  237 

5.10.2  Nonconstant  Variance .  237 

5.10.3  Correlated  Errors .  238 

5.11  Assessment  of  Assumptions .  239 

5.11.1  Review  of  Assumptions .  239 

5.11.2  Residuals  and  Influence .  240 

5.11.3  Using  the  Residuals .  243 

5.12  Example:  Prostate  Cancer .  245 

5.13  Concluding  Remarks .  247 

5.14  Bibliographic  Notes .  248 

5.15  Exercises .  249 

6  General  Regression  Models .  253 

6.1  Introduction .  253 

6.2  Motivating  Example:  Pharmacokinetics  of  Theophylline .  254 

6.3  Generalized  Linear  Models .  256 

6.4  Parameter  Interpretation .  259 

6.5  Likelihood  Inference  for  GLMs .  260 

6.5.1  Estimation .  260 

6.5.2  Computation .  263 

6.5.3  Hypothesis  Testing .  267 

6.6  Quasi-likelihood  Inference  for  GLMs  .  270 

6.7  Sandwich  Estimation  for  GLMs .  272 

6.8  Bayesian  Inference  for  GLMs .  273 

6.8.1  Prior  Specification .  273 

6.8.2  Computation .  274 

6.8.3  Hypothesis  Testing .  275 

6.8.4  Overdispersed  GLMs  .  276 

6.9  Assessment  of  Assumptions  for  GLMs .  278 

6.10  Nonlinear  Regression  Models .  283 

6.11  Identifiability .  284 

6. 12  Likelihood  Inference  for  Nonlinear  Models .  285 

6.12.1  Estimation .  285 

6.12.2  Hypothesis  Testing .  287 

6.13  Least  Squares  Inference .  289 

6.14  Sandwich  Estimation  for  Nonlinear  Models .  290 


Contents 


xv 

6.15  The  Geometry  of  Least  Squares .  291 

6.16  Bayesian  Inference  for  Nonlinear  Models .  294 

6.16.1  Prior  Specification .  294 

6.16.2  Computation .  294 

6.16.3  Hypothesis  Testing .  295 

6.17  Assessment  of  Assumptions  for  Nonlinear  Models .  298 

6.18  Concluding  Remarks .  299 

6.19  Bibliographic  Notes .  299 

6.20  Exercises .  300 

7  Binary  Data  Models .  305 

7.1  Introduction .  305 

7.2  Motivating  Examples .  306 

7.2.1  Outcome  After  Head  Injury .  306 

7.2.2  Aircraft  Fasteners  .  306 

7.2.3  Bronchopulmonary  Dysplasia .  307 

7.3  The  Binomial  Distribution .  308 

7.3.1  Genesis .  308 

7.3.2  Rare  Events .  309 

7.4  Generalized  Linear  Models  for  Binary  Data .  310 

7.4.1  Formulation .  310 

7.4.2  Link  Functions .  312 

7.5  Overdispersion .  313 

7.6  Logistic  Regression  Models .  316 

7.6.1  Parameter  Interpretation .  316 

7.6.2  Likelihood  Inference  for  Logistic  Regression  Models  . .  318 

7.6.3  Quasi-likelihood  Inference  for  Logistic 

Regression  Models .  321 

7.6.4  Bayesian  Inference  for  Logistic  Regression  Models _  321 

7.7  Conditional  Likelihood  Inference .  327 

7.8  Assessment  of  Assumptions .  331 

7.9  Bias,  Variance,  and  Collapsibility .  334 

7.10  Case-Control  Studies .  337 

7.10.1  The  Epidemiological  Context .  337 

7.10.2  Estimation  for  a  Case-Control  Study .  338 

7.10.3  Estimation  for  a  Matched  Case-Control  Study  .  341 

7.11  Concluding  Remarks .  343 

7.12  Bibliographic  Notes .  344 

7.13  Exercises .  345 

Part  III  Dependent  Data 

8  Linear  Models .  353 

8.1  Introduction .  353 

8.2  Motivating  Example:  Dental  Growth  Curves .  354 


XVI 


Contents 


8.3  The  Efficiency  of  Longitudinal  Designs .  356 

8.4  Linear  Mixed  Models .  359 

8.4.1  The  General  Framework .  359 

8.4.2  Covariance  Models  for  Clustered  Data .  360 

8.4.3  Parameter  Interpretation  for  Linear  Mixed  Models .  363 

8.5  Likelihood  Inference  for  Linear  Mixed  Models .  364 

8.5.1  Inference  for  Fixed  Effects  .  365 

8.5.2  Inference  for  Variance  Components  via 

Maximum  Likelihood .  367 

8.5.3  Inference  for  Variance  Components  via 

Restricted  Maximum  Likelihood .  368 

8.5.4  Inference  for  Random  Effects .  376 

8.6  Bayesian  Inference  for  Linear  Mixed  Models .  381 

8.6.1  A  Three-Stage  Hierarchical  Model .  381 

8.6.2  Hyperpriors .  382 

8.6.3  Implementation .  386 

8.6.4  Extensions .  388 

8.7  Generalized  Estimating  Equations .  391 

8.7.1  Motivation .  391 

8.7.2  The  GEE  Algorithm .  392 

8.7.3  Estimation  of  Variance  Parameters .  395 

8.8  Assessment  of  Assumptions .  400 

8.8.1  Review  of  Assumptions .  400 

8.8.2  Approaches  to  Assessment  .  402 

8.9  Cohort  and  Longitudinal  Effects .  413 

8.10  Concluding  Remarks .  416 

8.11  Bibliographic  Notes .  416 

8.12  Exercises .  417 

9  General  Regression  Models .  425 

9.1  Introduction .  425 

9.2  Motivating  Examples .  426 

9.2.1  Contraception  Data .  426 

9.2.2  Seizure  Data .  427 

9.2.3  Pharmacokinetics  of  Theophylline .  428 

9.3  Generalized  Linear  Mixed  Models .  430 

9.4  Likelihood  Inference  for  Generalized  Linear  Mixed  Models .  432 

9.5  Conditional  Likelihood  Inference  for  Generalized 

Linear  Mixed  Models .  437 

9.6  Bayesian  Inference  for  Generalized  Linear  Mixed  Models .  441 

9.6.1  Model  Formulation .  441 

9.6.2  Hyperpriors .  441 

9.7  Generalized  Linear  Mixed  Models  with  Spatial  Dependence _  445 

9.7. 1  A  Markov  Random  Field  Prior .  445 

9.7.2  Hyperpriors .  447 


Contents  xvii 

9.8  Conjugate  Random  Effects  Models .  450 

9.9  Generalized  Estimating  Equations  for  Generalized 

Linear  Models .  451 

9.10  GEE2:  Connected  Estimating  Equations  .  452 

9.11  Interpretation  of  Marginal  and  Conditional 

Regression  Coefficients .  455 

9.12  Introduction  to  Modeling  Dependent  Binary  Data .  457 

9.13  Mixed  Models  for  B  inary  Data .  458 

9. 13. 1  Generalized  Linear  Mixed  Models  for  Binary  Data .  458 

9.13.2  Likelihood  Inference  for  the  Binary  Mixed  Model .  462 

9.13.3  Bayesian  Inference  for  the  Binary  Mixed  Model .  462 

9.13.4  Conditional  Likelihood  Inference  for  Binary 

Mixed  Models .  465 

9.14  Marginal  Models  for  Dependent  Binary  Data .  467 

9.14.1  Generalized  Estimating  Equations  .  467 

9.14.2  Loglinear  Models .  468 

9.14.3  Further  Multivariate  Binary  Models .  471 

9.15  Nonlinear  Mixed  Models .  475 

9.16  Parameterization  of  the  Nonlinear  Model  .  477 

9. 17  Likelihood  Inference  for  the  Nonlinear  Mixed  Model .  479 

9.18  Bayesian  Inference  for  the  Nonlinear  Mixed  Model .  482 

9.18.1  Hyperpriors .  482 

9.18.2  Inference  for  Functions  of  Interest .  484 

9.19  Generalized  Estimating  Equations .  487 

9.20  Assessment  of  Assumptions  for  General  Regression  Models _  489 

9.21  Concluding  Remarks .  492 

9.22  Bibliographic  Notes .  495 

9.23  Exercises .  496 

Part  IV  Nonparametric  Modeling 

10  Preliminaries  for  Nonparametric  Regression  .  503 

10.1  Introduction .  503 

10.2  Motivating  Examples .  504 

10.2.1  Light  Detection  and  Ranging .  505 

10.2.2  Ethanol  Data .  505 

10.3  The  Optimal  Prediction .  506 

10.3.1  Continuous  Responses .  507 

10.3.2  Discrete  Responses  with  K  Categories .  508 

10.3.3  General  Responses .  510 

10.3.4  InPractice .  511 

10.4  Measures  of  Predictive  Accuracy .  511 

10.4.1  Continuous  Responses .  512 

10.4.2  Discrete  Responses  with  K  Categories .  515 

10.4.3  General  Responses .  517 


xviii 


Contents 


10.5  A  First  Look  at  Shrinkage  Methods .  517 

10.5.1  Ridge  Regression .  517 

10.5.2  The  Lasso .  523 

10.6  Smoothing  Parameter  Selection .  526 

10.6.1  Mallows  CP  .  527 

10.6.2  AT-Fold  Cross-Validation  .  529 

10.6.3  Generalized  Cross-Validation .  532 

10.6.4  AIC  for  General  Models .  534 

10.6.5  Cross-Validation  for  Generalized  Linear  Models .  538 

10.7  Concluding  Comments .  542 

10.8  Bibliographic  Notes .  543 

10.9  Exercises .  543 

11  Spline  and  Kernel  Methods .  547 

11.1  Introduction .  547 

1 1 .2  Spline  Methods  .  547 

11.2.1  Piecewise  Polynomials  and  Splines .  547 

11.2.2  Natural  Cubic  Splines .  552 

1 1.2.3  Cubic  Smoothing  Splines .  553 

11.2.4  B-Splines .  556 

11.2.5  Penalized  Regression  Splines .  557 

11.2.6  A  Brief  Spline  Summary .  560 

11.2.7  Inference  for  Linear  Smoothers  .  560 

1 1 .2.8  Linear  Mixed  Model  Spline  Representation: 

Likelihood  Inference .  563 

11.2.9  Linear  Mixed  Model  Spline  Representation: 

Bayesian  Inference .  567 

11.3  Kernel  Methods .  572 

11.3.1  Kernels .  574 

11.3.2  Kernel  Density  Estimation .  575 

11.3.3  The  Nadaray a- Watson  Kernel  Estimator .  578 

11.3.4  Local  Polynomial  Regression .  580 

11.4  Variance  Estimation .  584 

11.5  Spline  and  Kernel  Methods  for  Generalized  Linear  Models .  587 

11.5.1  Generalized  Linear  Models  with  Penalized 

Regression  Splines .  587 

11.5.2  A  Generalized  Linear  Mixed  Model  Spline 

Representation .  591 

1 1.5.3  Generalized  Linear  Models  with  Local  Polynomials _  592 

1 1 .6  Concluding  Comments .  593 

11.7  Bibliographic  Notes .  593 

11.8  Exercises .  594 


Contents 


xix 

12  Nonparametric  Regression  with  Multiple  Predictors .  597 

12.1  Introduction .  597 

12.2  Generalized  Additive  Models .  598 

12.2.1  Model  Formulation .  598 

12.2.2  Computation  via  Backfitting .  599 

12.3  Spline  Methods  with  Multiple  Predictors .  601 

12.3.1  Natural  Thin  Plate  Splines .  602 

12.3.2  Thin  Plate  Regression  Splines .  603 

12.3.3  Tensor  Product  Splines .  604 

12.4  Kernel  Methods  with  Multiple  Predictors .  607 

12.5  Smoothing  Parameter  Estimation .  608 

12.5.1  Conventional  Approaches .  608 

12.5.2  Mixed  Model  Formulation .  608 

12.6  Varying-Coefficient  Models .  610 

12.7  Regression  Trees .  614 

12.7.1  Hierarchical  Partitioning .  614 

12.7.2  Multiple  Adaptive  Regression  Splines .  622 

12.8  Classification .  624 

12.8.1  Logistic  Models  with  K  Classes  .  625 

12.8.2  Linear  and  Quadratic  Discriminant  Analysis .  626 

12.8.3  Kernel  Density  Estimation  and  Classification .  630 

12.8.4  Classification  Trees  .  634 

12.8.5  Bagging .  636 

12.8.6  Random  Forests .  639 

12.9  Concluding  Comments .  643 

12.10  Bibliographic  Notes .  644 

12.11  Exercises .  644 

Part  V  Appendices 

A  Differentiation  of  Matrix  Expressions .  649 

B  Matrix  Results .  653 

C  Some  Linear  Algebra .  655 

D  Probability  Distributions  and  Generating  Functions .  657 

E  Functions  of  Normal  Random  Variables .  667 

F  Some  Results  from  Classical  Statistics .  669 

G  Basic  Large  Sample  Theory .  673 

References .  675 


Index 


689 


Chapter  1 

Introduction  and  Motivating  Examples 


1.1  Introduction 

This  book  examines  how  a  response  is  related  to  covariates  using  mathematical  mod¬ 
els  whose  unknown  parameters  we  wish  to  estimate  using  available  information — 
this  endeavor  is  known  as  regression  analysis.  In  this  first  chapter,  we  will  begin  in 
Sect.  1.2  by  making  some  general  comments  about  model  formulation.  In  Sect.  1.3, 
a  number  of  examples  will  be  described  in  order  to  motivate  the  material  to 
follow  in  the  remainder  of  this  book.  In  Sect.  1.4,  we  examine,  in  simple  idealized 
scenarios,  how  “randomness”  is  induced  by  not  controlling  for  covariates  in  a 
model.  Section  1.5  briefly  contrasts  the  Bayesian  and  frequentist  approaches  to 
inference,  and  Sect.  1.7  gives  references  that  expand  on  the  material  of  this  chapter. 
Finally,  Sect.  1.6  summarizes  the  overall  message  of  this  book  which  is  that  in 
many  instances,  carefully  thought  out  Bayesian  and  frequentist  analyses  will  provide 
similar  conclusions;  however,  situations  in  which  one  or  the  other  approach  may  be 
preferred  are  also  described. 


1.2  Model  Formulation 

In  a  regression  analysis,  the  following  steps  may  be  followed: 

1.  Formulate  a  model  based  on  the  nature  of  the  data,  the  subject  matter  context, 
and  the  aims  of  the  data  analysis. 

2.  Examine  the  mathematical  properties  of  the  initial  model  with  respect  to 
candidate  inference  procedures.  This  examination  will  focus  on  whether  specific 
methods  are  suited  to  both  the  particular  context  under  consideration  and  the 
specific  questions  of  interest  in  the  analysis. 

3.  Consider  the  computational  aspects  of  the  model. 
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The  examination  in  steps  2  and  3  may  suggest  that  we  need  to  change  the  model. 1 
Historically,  the  range  of  model  forms  that  were  available  for  regression  modeling 
was  severely  limited  by  computational  and,  to  a  lesser  extent,  mathematical 
considerations.  For  example,  though  generalized  linear  models  contain  a  flexible 
range  of  alternatives  to  the  linear  model,  a  primary  motivation  for  their  formulation 
was  ease  of  fitting  and  mathematical  tractability.  Hence,  step  3  in  particular  took 
precedent  over  step  1 . 

Specific  aspects  of  the  initial  model  formulation  will  now  be  discussed  in 
more  detail.  When  carrying  out  a  regression  analysis,  careful  consideration  of  the 
following  issues  is  vital  and  in  many  instances  will  outweigh  in  importance  the 
particular  model  chosen  or  estimation  method  used.  The  interpretation  of  parameters 
also  depends  vitally  on  the  following  issues. 


Observational  Versus  Experimental  Data 

An  important  first  step  in  data  analysis  is  to  determine  whether  the  data  are 
experimental  or  observational  in  nature.  In  an  experimental  study,  the  experimenter 
has  control  over  at  least  some  aspects  of  the  study.  For  example,  units  (e.g.,  patients) 
may  be  randomly  assigned  to  covariate  groups  of  interest  (e.g.,  treatment  groups). 
If  this  randomization  is  successfully  implemented,  any  differences  in  response  will 
(in  expectation)  be  due  to  group  assignment  only,  allowing  a  causal  interpretation 
of  the  estimated  parameters.  The  beauty  of  randomization  is  that  the  groups  are 
balanced  with  respect  to  all  covariates,  crucially  including  those  that  are  unobserved. 

In  an  observational  study,  we  never  know  whether  observed  differences  between 
the  responses  of  groups  of  interest  are  due,  at  least  partially,  to  other  “confounding” 
variables  related  to  group  membership.  If  the  confounders  are  measured,  then  there 
is  some  hope  for  controlling  for  the  variability  in  response  that  is  not  due  to  group 
membership,  but  if  the  confounders  are  unobserved  variables,  then  such  control  is 
not  possible.  In  the  epidemiology  and  biostatistics  literature,  this  type  of  discrepancy 
between  the  estimate  and  the  “true”  quantity  of  interest  is  often  described  as  bias 
due  to  confounding.  In  later  chapters,  this  issue  will  be  examined  in  detail,  since  it 
is  a  primary  motivation  for  regression  modeling.  In  observational  studies,  estimated 
coefficients  are  traditionally  described  as  associations,  and  causality  is  only  alluded 
to  more  informally  via  consideration  of  the  combined  evidence  of  different  studies 
and  scientific  plausibility.  We  expand  upon  this  discussion  in  Sect.  1 .4. 

Predictive  models  are  more  straightforward  to  build  than  causal  models.  To 
quote  Freedman  (1997),  “For  description  and  prediction,  the  numerical  values  of  the 
individual  coefficients  fade  into  the  background;  it  is  the  whole  linear  combination 
on  the  right-hand  side  of  the  equation  that  matters.  For  causal  inference,  it  is  the 
individual  coefficients  that  do  the  trick.” 


'To  make  clear,  we  are  not  suggesting  refining  the  model  based  on  inadequacies  of  fit;  this  is  a 
dangerous  enterprise,  as  we  discuss  in  Chap.  4. 
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Study  Population 

Another  important  step  is  to  determine  the  population  from  which  the  data  were 
collected  so  that  the  individuals  to  whom  inferential  conclusions  apply  may  be 
determined.  Extrapolation  of  inference  beyond  the  population  providing  the  data 
is  a  risky  enterprise. 

Throughout  this  book,  we  will  take  a  superpopulation  view  in  which  probability 
models  are  assumed  to  describe  variability  with  respect  to  a  hypothetical,  infinite 
population.  The  study  population  that  exists  in  practice  consists  of  N  units,  of  which 
n  are  sampled.  To  summarize: 

Superpopulation  (oo)  — >  Study  Population  ( N )  — >  Sample  (n) 

Inference  for  the  parameters  of  a  superpopulation  may  be  contrasted  with  a  survey 
sampling  perspective  in  which  the  focus  is  upon  characteristics  of  the  responses  of 
the  N  units;  in  the  latter  case,  a  full  census  (n  =  N)  will  obviate  the  need  for 
statistical  analysis. 


The  Sampling  Scheme 

The  data  collection  procedure  has  implications  for  the  analysis,  in  terms  of  the 
models  that  are  appropriate,  the  questions  that  may  be  asked,  and  the  inferential 
approach  that  may  be  adopted.  In  the  most  straightforward  case,  the  data  arise 
through  random  sampling  from  a  well-defined  population.  In  other  situations,  the 
random  samples  may  be  drawn  from  within  covariate-defined  groups,  which  may 
improve  efficiency  of  estimation  by  concentrating  the  sampling  in  informative 
groups  but  may  limit  the  range  of  questions  that  can  be  answered  by  the  data 
due  to  the  restrictions  on  the  sampling  scheme.  In  more  complex  situations,  the 
data  may  result  from  outcome-dependent  sampling.  For  example,  a  case-control 
study  is  an  outcome-dependent  sampling  scheme  in  which  the  binary  response  of 
interest  is  fixed  by  design,  and  the  random  variables  are  the  covariates  sampled 
within  each  of  the  outcome  categories  (cases  and  controls).  For  such  data,  care  is 
required  because  the  majority  of  conventional  approaches  will  not  produce  valid 
inference,  and  analysis  is  carried  out  most  easily  using  logistic  regression  models. 
Similar  issues  are  encountered  in  the  analysis  of  matched  case-control  studies,  in 
which  cases  and  controls  are  matched  upon  additional  (confounder)  variables.  Bias 
in  parameters  of  interest  will  occur  if  such  data  are  analyzed  using  methods  for 
unmatched  studies,  again  because  the  sampling  scheme  has  not  been  acknowledged. 
In  the  case  of  individually  matched  cases  and  controls  (in  which,  for  example,  for 
each  case  a  control  is  picked  with  the  same  gender,  age,  and  race),  conventional 
likelihood-based  methods  are  flawed  because  the  number  of  parameters  (including 
one  parameter  for  each  case-control  pair)  increases  with  the  sample  size  (providing 
an  example  of  the  importance  of  paying  attention  to  the  regularity  conditions 
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required  for  valid  inference) — conditional  likelihood  provides  a  valid  inferential 
approach  in  this  case.  The  analysis  of  data  from  case-control  studies  is  described 
in  Chap.  7. 


Missing  Data 

Measurements  may  be  missing  on  the  responses  which  can  lead  to  bias  in  estimation, 
depending  on  the  reasons  for  the  absence.  It  is  clear  that  bias  will  arise  when  the 
probability  of  missingness  depends  on  the  size  of  the  response  that  would  have  been 
observed.  An  extreme  example  is  when  the  result  of  a  chemical  assay  is  reported 
as  “below  the  lower  limit  of  detection”;  such  a  variable  may  be  reported  as  the 
(known)  lower  limit,  or  as  a  zero,  and  analyzing  the  data  using  these  values  can 
lead  to  substantial  bias.  Removing  these  observations  will  also  lead  to  bias.  In 
the  analysis  of  individual-level  data  over  time  (to  give  so-called  longitudinal  data) 
another  common  mechanism  for  missing  observations  is  when  individuals  drop  out 
of  the  study. 


Aim  of  the  Analysis 

The  primary  aim  of  the  analysis  should  always  be  kept  in  mind;  in  particular,  is 
the  purpose  descriptive,  exploratory  (e.g.,  for  hypothesis  generation),  confirmatory 
(with  respect  to  an  a  priori  hypothesis),  or  predictive?  Regression  models  can  be 
used  for  each  of  these  endeavors,  but  the  manner  of  their  use  will  vary.  Large 
data  sets  can  often  be  succinctly  described  using  parsimonious2  regression  models. 
Exploratory  studies  are  often  informal  in  nature,  and  many  different  models  may 
be  fitted  in  order  to  gain  insights  into  the  structure  of  the  data.  In  general,  however, 
great  care  must  be  taken  with  data  dredging  since  spurious  associations  may  be 
discovered  due  to  chance  alone. 

The  level  of  sophistication  of  the  analysis,  and  the  assumptions  required,  will 
vary  as  the  aims  and  abundance  of  data  differ.  For  example,  if  one  has  a  million 
observations  independently  sampled  from  a  population,  and  one  requires  inference 
for  the  mean  of  the  population,  then  inference  may  be  based  on  the  sample  mean 
and  sample  standard  deviation  alone,  without  recourse  to  more  sophisticated  models 
and  approaches — we  would  expect  such  inference  to  be  reliable,  being  based  on  few 
assumptions.  Similarly,  inference  is  straightforward  if  we  are  interested  in  the  aver¬ 
age  response  at  an  observed  covariate  value  for  which  abundant  data  were  recorded. 


2The  Oxford  English  Dictionary  describes  parsimony  as  . .  that  no  more  causes  or  forces  should 
be  assumed  than  are  necessary  to  account  for  the  facts,”  which  serves  our  purposes,  though  care  is 
required  in  the  use  of  the  words  “causes,”  “forces,"  and  “facts.” 
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However,  if  such  data  are  not  available  (e.g.,  when  the  number  of  covariates  becomes 
large  or  the  sample  size  is  small),  or  if  interpolation  is  required,  regression  models 
are  beneficial,  as  they  allow  the  totality  of  the  data  to  estimate  global  parameters 
and  smooth  across  unstructured  variability.  To  answer  many  statistical  questions, 
very  simple  approaches  will  often  suffice;  the  art  of  statistical  analysis  is  deciding 
upon  when  a  more  sophisticated  approach  is  necessary/warranted,  since  dependence 
on  assumptions  usually  increases  with  increasing  sophistication. 


1.3  Motivating  Examples 

We  now  introduce  a  number  of  examples  to  illustrate  different  data  collection 
procedures,  types  of  data,  and  study  aims.  We  highlight  the  distinguishing  features 
of  the  data  in  each  example  and  provide  a  signpost  to  the  chapter  in  which 
appropriate  methods  of  analysis  may  be  found. 

In  general,  data  =  1 , . . . ,  n}  will  be  available  on  n  units,  with  Yt 

representing  the  univariate  response  variable  and  Xi  =  [1.  .xyi  , . . . ,  x.tk]  the  row 
vector  of  explanatory  variables  on  unit  i.  Variables  written  as  uppercase  letters  will 
represent  random  variables,  and  those  in  lowercase  fixed  quantities,  with  boldface 
representing  vectors  and  matrices. 


1.3.1  Prostate  Cancer 

We  describe  a  dataset  analyzed  by  Tibshirani  (1996)  and  originally  presented  by 
Stamey  et  al.  (1989).  The  data  were  collected  on  n  =  97  men  before  radical 
prostatectomy,  which  is  a  major  surgical  operation  that  removes  the  entire  prostate 
gland  along  with  some  surrounding  tissue.  We  take  as  response,  Y,  the  log  of 
prostate  specific  antigen  (PSA);  PSA  is  a  concentration  and  is  measured  in  ng/ml. 
In  Stamey  et  al.  (1989),  PSA  was  proposed  as  a  preoperative  marker  to  predict  the 
clinical  stage  of  cancer.  As  well  as  modeling  the  stage  of  cancer  as  a  function 
of  PSA,  the  authors  also  examined  PSA  as  a  function  of  age  and  seven  other 
histological  and  morphometric  covariates.  We  take  as  our  aim  the  building  of  a 
predictive  model  for  PSA,  using  the  eight  covariates: 

•  log(can  vol):  The  log  of  cancer  volume,  measured  in  milliliters  (cc).  The  area 
of  cancer  was  measured  from  digitized  images  and  multiplied  by  a  thickness  to 
produce  a  volume. 

•  log(weight):  The  log  of  the  prostate  weight,  measured  in  grams. 

•  Age:  The  age  of  the  patient,  in  years. 

•  log(BPH):  The  log  of  the  amount  of  benign  prostatic  hyperplasia  (BPH),  a 
noncancerous  enlargement  of  the  prostate  gland,  as  an  area  in  a  digitized  image 
and  reported  in  cm2. 
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•  SVI:  The  seminal  vesicle  invasion,  a  0/1  indicator  of  whether  prostate  cancer 
cells  have  invaded  the  seminal  vesicle. 

•  log(cap  pen):  The  log  of  the  capsular  penetration,  which  represents  the  level  of 
extension  of  cancer  into  the  capsule  (the  fibrous  tissue  which  acts  as  an  outer 
lining  of  the  prostate  gland).  Measured  as  the  linear  extent  of  penetration,  in  cm. 

•  Gleason:  The  Gleason  score,  a  measure  of  the  degree  of  aggressiveness  of  the 
tumor.  The  Gleason  grading  system  assigns  a  grade  (1-5)  to  each  of  the  two 
largest  areas  of  cancer  in  the  tissue  samples  with  1  being  the  least  aggressive 
and  5  the  most  aggressive;  the  two  grades  are  then  added  together  to  produce  the 
Gleason  score. 

•  PGS45:  The  percentage  of  Gleason  scores  that  are  4  or  5. 

The  BPH  and  capsular  penetration  variables  originally  contained  zeros,  and  a 
small  number  was  substituted  before  the  log  transform  was  taken.  It  is  not  clear 
from  the  original  paper  why  the  log  transform  was  taken  though  PSA  varies  over  a 
wide  range,  and  so  linearity  of  the  mean  model  may  be  aided  by  the  log  transform. 
It  is  also  not  clear  why  the  variable  PGS45  was  constructed.  If  initial  analyses  were 
carried  out  to  find  variables  that  were  associated  with  PSA,  then  significance  levels 
of  hypothesis  tests  will  not  be  accurate  (since  they  are  not  based  on  an  a  priori 
hypotheses  but  rather  are  the  result  of  data  dredging). 

Carrying  out  exploratory  data  analysis  (EDA)  is  a  vital  step  in  any  data  analysis. 
Such  an  enterprise  includes  the  graphical  and  tabular  examination  of  variables,  the 
checking  of  the  data  for  errors  (for  example,  to  see  if  variables  are  within  their 
admissible  ranges),  and  the  identification  of  outlying  (unusual)  observations  or 
influential  observations  that  when  perturbed  lead  to  large  changes  in  inference.  This 
book  is  primarily  concerned  with  methods,  and  the  level  of  EDA  that  is  performed 
will  be  less  than  would  be  desirable  in  a  serious  data  analysis. 

Figure  1.1  displays  the  response  plotted  against  each  of  the  covariates  and 
indicates  a  number  of  associations.  The  association  between  Y  and  log(can  vol) 
appears  particularly  strong.  In  observational  settings  such  as  this,  there  are  often 
strong  dependencies  between  the  covariates.  We  may  investigate  these  dependencies 
using  scatterplots  (or  tables,  if  both  variables  are  discrete).  Figure  1.2  gives  an 
indication  of  the  dependencies  between  those  variables  that  exhibit  the  strongest 
associations;  log(can  vol)  is  strongly  associated  with  a  number  of  other  covariates. 
Consequently,  we  might  expect  that  adding  log(can  vol)  to  a  model  for  log(PSA)  that 
contains  other  covariates  will  change  the  estimated  associations  between  log(PSA) 
and  the  other  variables. 

We  define  Yt  as  the  log  of  PSA  and  Xi  =  [1,  Xu, . . . ,  Xis]  as  the  1x9  row 
vector  associated  with  patient  i,  i  =  1, . . . ,  n  =  97.  We  may  write  a  general  mean 
model  as  E[Yj  |  x.j]  =  f(xi,(3)  where  /(•,•)  represents  the  functional  form  and 
(3  unknown  regression  parameters.  The  most  straightforward  form  is  the  multiple 
linear  regression 

=  p0  +  'Y^lXijpj, 
j&c 


(1.1) 
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Fig.  1.1  The  response  y  =  log(PSA)  plotted  versus  each  of  the  eight  explanatory  variables,  x,  in 
the  prostate  cancer  study,  with  local  smoothers  superimposed  for  continuous  covariates 


where  C  corresponds  to  the  subset  of  elements  of  {1,  2, . . . ,  8}  whose  associated 
covariates  we  wish  to  include  in  the  model  and  (3  =  [/3o,{/3 j,j  £  C}]T.  The 
interpretation  of  each  of  the  coefficients  f 3j  depends  crucially  on  knowing  the  scaling 
and  units  of  measurement  of  the  associated  variables  Xj . 

Most  of  the  x  variables  in  this  study  are  measured  with  error  (as  is  clear  from 
their  derivation,  e.g.,  log(BPH)  is  derived  from  a  digitized  image),  and  if  we  are 
interested  in  estimating  causal  effects,  then  this  aspect  needs  to  be  acknowledged 
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Fig.  1.2  Associations  between  selected  explanatory  variables  in  the  prostate  cancer  study,  with 
local  smoothers  superimposed  for  continuous  covariates 


in  the  models  that  are  fitted,  since  inference  is  affected  in  this  situation,  which  is 
known  as  errors-in-variables. 

Distinguishing  Features.  Inference  for  multiple  linear  regression  models  is  de¬ 
scribed  in  Chap.  5,  including  a  discussion  of  parameter  interpretation.  Chapter  4 
discusses  the  difficult  but  important  topics  of  model  formulation  and  selection. 
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Table  1.1  Outcome  after  head  injury  as  a  function  of  four  covariates:  pupils,  hematoma  present, 
coma  score,  and  age 


Pupils 

Hematoma  present 
Coma  score  Low 

Good 

No 

High 

Low 

Yes 

High 

Low 

Poor 

No 

High 

Low 

Yes 

High 

1-25 

Dead 

9 

5 

5 

7 

58 

11 

32 

12 

Alive 

47 

77 

11 

24 

29 

24 

13 

16 

Age 

26-54 

Dead 

19 

6 

21 

14 

45 

7 

61 

15 

(years) 

Alive 

15 

44 

18 

38 

11 

16 

11 

21 

>55 

Dead 

7 

12 

19 

25 

20 

7 

42 

17 

Alive 

1 

6 

2 

15 

0 

2 

7 

7 

1.3.2  Outcome  After  Head  Injury 

Table  1.1  reports  data  presented  by  Titterington  et  al.  (1981)  in  a  study  initiated 
by  the  Institute  of  Neurological  Sciences  in  Glasgow.  These  data  were  collected 
prospectively  by  neurosurgeons  between  1968  and  1976.  The  original  aim  was  to 
predict  recovery  for  individual  patients  on  the  basis  of  data  collected  shortly  after 
the  injury.  The  data  that  we  consider  contain  information  on  a  binary  outcome, 
Y  =  0/1,  corresponding  to  dead/alive  after  head  injury,  and  the  covariates:  pupils 
(with  good  corresponding  to  a  reaction  to  light  and  poor  to  no  reaction),  coma 
score  (representing  depth  of  coma,  low  or  high),  hematoma  present  (no/yes),  and 
age  (categorized  as  1-25,  26-54,  >55). 

The  response  of  interest  here  is  p(x)  =  Pr(Y  =  1  |  x);  the  probability  that  a 
patient  with  covariates  x  is  alive.  This  quantity  must  lie  in  the  range  [0,1],  and  so,  at 
least  in  this  respect,  linear  models  are  unappealing.  To  illustrate,  suppose  we  have  a 
univariate  continuous  covariate  x  and  the  model 

p(x)  =  p0  +  fhx. 

While  probabilities  not  close  to  zero  or  one  may  change  at  least  approximately 
linearly  with  x,  it  is  extremely  unlikely  that  this  behavior  will  extend  to  the  extremes, 
where  the  probability-covariate  relationship  must  flatten  out  in  order  to  remain 
in  the  correct  range.  An  additional,  important,  consideration  is  that  linear  models 
commonly  assume  that  the  variance  is  constant  and,  in  particular,  does  not  depend 
on  the  mean.  For  a  binary  outcome  with  probability  of  response  p(x),  the  Bernoulli 
variance  is  p(x)[  1  —  p(x)\  and  so  depends  on  the  mean.  As  we  will  see,  accurate 
inference  depends  crucially  on  having  modeled  the  mean-variance  relationship 
appropriately. 

A  common  model  for  binary  data  is  the  logistic  regression  model,  in  which  the 
odds  of  death,  p{x)/[  1  —  p(a:)],  is  modeled  as  a  function  of  x.  For  example,  the 
linear  logistic  regression  model  is 

t  P<yX]  ,  =  exP (A)  +  fiix). 

1  -P{x) 
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This  form  is  mathematically  appealing,  since  the  modeled  probabilities  are  con¬ 
strained  to  lie  within  [0,1],  though  the  interpretation  of  the  parameters  do  and  j3i  is 
not  straightforward. 

Distinguishing  Features.  Chapter  7  is  dedicated  to  the  modeling  of  binary  data.  In 
this  chapter,  logistic  regression  models  are  covered  in  detail,  along  with  alternatives. 
Formulating  predictive  models  and  assessing  the  predictive  power  of  such  models 
is  considered  in  Chaps.  10-12. 


1.3.3  Lung  Cancer  and  Radon 

We  now  describe  an  example  in  which  the  data  arise  from  a  spatial  ecological  study. 
In  an  ecological  study,  the  unit  of  analysis  is  the  group  rather  than  the  individual.  In 
spatial  epidemiological  studies,  due  primarily  to  reasons  of  confidentiality,  data  on 
disease,  population,  and  exposure  are  often  available  as  aggregates  across  area.  It  is 
these  areas  that  constitute  the  (ecological)  group  level  at  which  the  data  are  analyzed. 
In  this  example,  we  examine  the  association  between  lung  cancer  incidence  (over 
the  years  1998-2002)  and  residential  radon  at  the  level  of  the  county,  in  Minnesota. 
Radon  is  a  naturally  occurring  radioactive  gas  produced  by  the  breakdown  of 
uranium  in  soil,  rock,  and  water  and  is  a  known  carcinogen  for  lung  cancer  (Darby 
et  al.  2001).  However,  in  many  ecological  studies,  when  the  association  between 
lung  cancer  incidence  and  residential  radon  is  estimated,  radon  appears  protective. 
Ecological  bias  is  an  umbrella  term  that  refers  to  the  distortion  of  individual-level 
associations  due  to  the  process  of  aggregation.  There  are  many  facets  to  ecological 
bias  (Wakefield  2008),  but  an  important  issue  in  the  lung  cancer/radon  context  is  the 
lack  of  control  for  confounding,  a  primary  source  being  smoking. 

Let  Yj  denote  the  lung  cancer  incidence  count  and  x,  the  average  radon  in  county 
i  =  1  =  87.  Age  and  gender  are  strongly  associated  with  lung  cancer 

incidence,  and  a  standard  approach  to  controlling  these  factors  is  to  form  expected 
counts  Ei  =  Nijqj  in  which  we  multiply  the  population  in  stratum  j  and 

county  i,  Nij ,  by  a  “reference”  probability  of  lung  cancer  in  stratum  j,  qj ,  to  obtain 
the  expected  count  in  stratum  j.  Summing  over  all  J  stratum  gives  the  total  expected 
count.  Intuitively,  these  counts  are  what  we  would  expect  if  the  disease  rates  in 
county  i  conform  with  the  reference.  A  summary  response  measure  in  county  i 
is  the  standardized  morbidity  ratio  (SMR),  given  by  Yi/Ei.  Counties  with  SMRs 
greater  than  1  have  an  excess  of  cases,  when  compared  to  that  expected. 

Figure  1.3  maps  the  SMRs  in  counties  of  Minnesota,  and  we  observe  more 
than  twofold  variability  with  areas  of  high  incidence  in  the  northeast  of  the  state. 
Figure  1 .4  maps  the  average  radon  by  county,  with  low  radon  in  the  counties  to  the 
northeast.  This  negative  association  is  confirmed  in  Fig.  1.5  in  which  we  plot  the 
SMRs  versus  average  radon,  with  a  smoother  indicating  the  local  trend. 
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Fig.  1.3  Standardized 
morbidity  ratios  for  lung 
cancer  in  the  period 
1998-2002  by  county  in 
Minnesota 


□  [0.58,0.68) 

□  [0.68,0.78) 

□  [0.78,0.88) 

□  [0.88,0.98) 
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■  [1.28,1.38] 


Fig.  1.4  Average  radon 
(pCi/liter)  by  county  in 
Minnesota 
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Fig.  1.5  Standardized 
morbidity  ratios  versus 
average  radon  (pCi/liter)  by 
county  in  Minnesota 


Average  radon  (pCi/liter) 


A  simple  model  that  constrains  the  mean  to  be  positive  is  the  loglinear  regression 


logE 


E> 


Xi 


=  Po  +  Pi  Xi 


i  =  1, . . . ,  n.  We  might  combine  this  form  with  a  Poisson  model  for  the  counts. 
However,  in  a  Poisson  model,  the  variance  is  constrained  to  equal  the  mean, 
which  is  often  too  restrictive  in  practice,  since  excess-Poisson  variability  is  often 
encountered.  Hence,  we  would  prefer  to  fit  a  more  flexible  model.  We  might  also  be 
concerned  with  residual  spatial  dependence  between  disease  counts  in  counties  that 
are  close  to  each  other.  Information  on  confounder  variables,  especially  smoking, 
would  also  be  desirable. 

Distinguishing  Features.  Poisson  regression  models  for  independent  data,  and 
extensions  to  allow  for  excess-Poisson  variation,  are  described  in  Chap.  6.  Such 
models  are  explicitly  designed  for  nonnegative  response  variables.  Accounting  for 
residual  spatial  dependence  is  considered  in  Chap.  9. 


1.3.4  Pharmacokinetic  Data 

Pharmacokinetics  is  the  study  of  the  time  course  of  a  drug  and  its  metabolites  after 
introduction  into  the  body.  A  typical  experiment  consists  of  a  known  dose  of  drug 
being  administered  via  a  particular  route  (e.g.,  orally  or  via  an  injection)  at  a  known 
time.  Subsequently,  blood  samples  are  taken,  and  the  concentration  of  the  drug  is 
measured.  The  data  are  in  the  form  of  n  pairs  of  points  [xj,  yp\,  where  Xi  denotes  the 
sampling  time  at  which  the  «th  blood  sample  is  taken  and  y,  denotes  the  ith  measured 
concentration,  i  =  1, . . . ,  n.  We  describe  in  some  detail  some  of  the  contextual 
scientific  background  in  order  to  motivate  a  particular  regression  model. 

A  typical  dataset,  taken  from  Upton  et  al.  (1982),  is  tabulated  in  Table  1.2  and 
plotted  in  Fig.  1.6.  These  data  were  collected  after  a  subject  was  given  an  oral  dose 
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Table  1.2  Concentration  (y)  of  the  drug  theophylline  as  a  function  of  time  ( x ),  obtained  from  a 
subject  who  was  administered  an  oral  dose  of  size  4.53  mg/kg 


Observation 

number 

i 

Time 

(hours) 

Xi 

Concentration 

(mg/liter) 

Vi 

1 

0.27 

4.40 

2 

0.58 

6.90 

3 

1.02 

8.20 

4 

2.02 

7.80 

5 

3.62 

7.50 

6 

5.08 

6.20 

7 

7.07 

5.30 

8 

9.00 

4.90 

9 

12.15 

3.70 

10 

24.17 

1.05 

Fig.  1.6  Concentration  of 
theophylline  plotted  versus 
time  for  the  data  of  Table  1.2 


of  4.53  mg/kg  of  the  antiasthmatic  agent  theophylline.  The  concentration  of  drug 
was  determined  in  subsequent  blood  samples  using  a  chemical  assay  (a  method  for 
determining  the  amount  of  a  specific  substance  in  a  sample).  Data  were  collected 
over  a  period  slightly  greater  than  24  h  following  drug  administration. 

Pharmacokinetic  experiments  are  important  as  they  help  in  understanding  the 
absorption,  distribution,  and  elimination  processes  of  drugs.  Such  an  understanding 
provides  information  that  may  be  used  to  decide  upon  the  sizes  and  timings  of 
doses  that  should  be  administered  in  order  to  achieve  concentrations  falling  within 
a  desired  therapeutic  window.  Often  the  concentration  of  drug  acts  as  a  surrogate 
for  the  therapeutic  response.  The  aim  of  a  pharmacokinetic  trial  may  be  dose 
recommendation  for  a  specific  population,  for  example,  to  determine  a  dose  size 
for  the  packaging,  or  recommendations  for  a  particular  patient  based  on  covariates, 
which  is  known  as  individualization.  A  typical  question  is,  for  the  patient  who 
produced  the  data  in  Table  1.2,  what  dose  could  we  give  at  25  h  to  achieve  a 
concentration  of  10  mg/1  at  37  h? 
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Fig.  1.7  Representation  of  a 
one-compartment  system 
with  oral  dosing. 
Concentrations  are  measured 
in  compartment  1 


The  processes  determining  drug  concentrations  are  very  complicated,  but  sim¬ 
ple  compartmental  models  (e.g.,  Godfrey  1983)  have  been  found  to  mimic  the 
concentrations  observed  in  patients.  The  basic  idea  is  to  model  the  body  as  a 
system  of  compartments  within  each  of  which  the  kinetics  of  the  drug  flow  is 
assumed  to  be  similar.  We  consider  the  simplest  possible  model  for  modeling 
drug  concentrations  following  the  administration  of  an  oral  dose.  The  model  is 
represented  in  Fig.  1.7  and  assumes  that  the  body  consists  of  a  compartment  into 
which  the  drug  is  introduced  and  from  which  absorption  occurs  into  a  second  “blood 
compartment.”  The  compartments  are  labeled  retrospectively  as  0  and  1  in  Fig.  1.7. 
Subsequently,  elimination  from  compartment  1  occurs  with  blood  samples  taken 
from  this  compartment. 

We  now  describe  in  some  detail  the  one-compartment  model  with  first-order 
absorption  and  elimination.  Let  Wk  (t)  represent  the  amount  of  drug  in  compartment 
k  at  time  t,  k  =  0, 1.  The  drug  flow  between  the  compartments  is  described  by  the 
differential  equations 


dwo 

dt 

du>i 

dt 


- kaw0 , 


kaw0  -  kew i, 


(1.2) 

(1.3) 


where  ka  >  0  is  the  absorption  rate  constant  associated  with  the  flow  from 
compartment  0  to  compartment  1  and  ke  >  0  is  the  elimination  rate  constant 
(see  Fig.  1.7).  At  time  zero,  the  initial  dose  is  wo(0)  =  D,  and  solving  the  pair 
of  differential  equations  (1.2)  and  (1.3),  subject  to  this  condition,  gives  the  amount 
of  drug  in  the  body  at  time  x  as 


Wi(a;) 


Dkg 

kg  ~  ke 


[exp(— fcex)  -  exp(— fcaa;)]. 


(1.4) 
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We  do  not  measure  the  amount  of  total  drug  but  drag  concentration,  and  so  we 
need  to  normalize  (1.4)  by  dividing  w-\  (x)  by  the  volume  V  >  0  of  the  blood 
compartment  to  give 

Dk 

/i(a;)  =  — - — [exp(— fce:r)  -  exp(-fcaa;)].  (1.5) 

v  (ka  ke ) 

so  that  n(x)  is  the  drug  concentration  in  the  blood  compartment  at  time  x. 
Equation  (1.5)  describes  a  model  that  is  nonlinear  in  the  parameters  V,  ka  and  ke ; 
for  reasons  that  will  be  examined  in  detail  in  Chap.  6,  inference  for  such  models  is 
more  difficult  than  for  their  linear  counterparts. 

We  have  so  far  ignored  the  stochastic  element  of  the  model.  An  obvious  error 
model  is 

Vi  =  /x( Xi )  + 

with  E[ei]  =  0,  var(ej)  =  of ,  i  =  1, . . . ,  n,  and  cov(ei,  ej)  =  0  ,i  /  j.  We  may 
go  one  stage  further  and  assume  |  of  ~ud  N(0,of)  where  d  is  shorthand 
for  “is  independent  and  identically  distributed  as.”  There  are  a  number  of  potential 
difficulties  with  this  error  model,  beyond  the  distributional  choice  of  normality. 
Concentrations  must  be  nonnegative,  and  so  we  might  expect  the  magnitude  of 
errors  to  decrease  with  decreasing  “true”  concentration  //(.x1),  a  phenomenon  that 
is  often  confirmed  by  examination  of  assay  validation  data.  The  error  terms  are 
likely  to  reflect  not  only  assay  precision,  however,  but  also  model  misspecification, 
and  given  the  simple  one-compartment  system  we  have  assumed,  this  could  be 
substantial.  We  might  therefore  expect  the  error  terms  to  display  correlation  across 
time.  In  this  example,  the  scientific  context  therefore  provides  not  only  a  mean 
function  but  also  information  on  how  the  variance  of  the  data  changes  with  the 
mean. 

One  simple  solution,  to  at  least  some  of  these  difficulties,  is  to  take  the  logarithm 
of  (1.5)  and  fit  the  model: 


log  Vi  =  log  /i(cCi)  +  Si. 


We  may  further  assume  E[ch]  =  0,  var(<5i)  =  a2,  i  =  1, ...,  n,  and  co v(<5i;  Sj)  =  0, 
i  j ,  multiplicative  errors  on  the  original  scale  and  additive  errors  on  the  log  scale 
give 

var(Y)  =  ii(x)2var(e.s)  «  /x( x)2a2 

for  small  S. 

There  are  two  other  issues  that  are  relevant  to  modeling  in  this  example.  The  first 
is  that  in  pharmacokinetic  analyses,  interest  often  focuses  on  derived  parameters 
of  interest,  which  are  functions  of  [V.  kn ,  ke\.  In  particular,  we  may  wish  to  make 
inference  for  the  time  to  maximum  concentration,  the  maximum  concentration,  the 
clearance  (initial  dose  divided  by  the  area  under  the  concentration  curve),  and  the 
elimination  half-life,  which  are  given  by 


^max  — 


1 


kn.  -  kP 


log 
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Cmax  —  AH-t-max 

Cl  =  V  X  ke 

_  log  2 

tl/2  ~  ' 

A-e 

A  second  issue  is  that  model  (1.5)  is  unidentifiable  in  the  sense  that  the  parameters 
[V,  ka,  ke]  give  the  same  curve  as  the  parameters  [Vke/ka,  ke ,  ka\.  This  identifiabil- 
ity  problem  can  be  overcome  via  a  restriction  such  as  constraining  the  absorption 
rate  to  exceed  the  elimination  rate,  ka>  ke>  0,  though  this  complicates  inference. 

Often  the  data  available  for  individualization  will  be  sparse.  For  example, 
suppose  we  only  observed  the  first  two  observations  in  Table  1.2.  In  this  situation, 
inference  is  impossible  without  additional  information  (since  there  are  more 
parameters  than  data  points),  which  suggests  a  Bayesian  approach  in  which  prior 
information  on  the  unknown  parameters  is  incorporated  into  the  analysis. 

Distinguishing  Features.  Model  (1.5)  is  nonlinear  in  the  parameters.  Such  models 
will  be  considered  in  Chap.  6,  including  their  use  in  situations  in  which  additional 
information  on  the  parameters  is  incorporated  via  the  specification  of  a  prior 
distribution.  The  data  in  Table  1.2  are  from  a  single  subject.  In  the  original  study, 
data  were  available  for  12  subjects,  and  ideally  we  would  like  to  analyze  the 
totality  of  data;  hierarchical  models  provide  one  framework  for  such  an  analysis. 
Hierarchical  nonlinear  models  are  considered  in  Chap.  9. 


D 

V 


ka 


ke/(ka  ke ) 


1.3.5  Dental  Growth 

Table  1.3  gives  dental  measurements  of  the  distance  in  millimeters  from  the  center 
of  the  pituitary  gland  to  the  pteryo-maxillary  fissure  in  1 1  girls  and  16  boys  recorded 
at  the  ages  of  8, 10,  12,  and  14  years.  These  data  were  originally  analyzed  in  Potthoff 
and  Roy  (1964). 

Figure  1 .8  plots  these  data,  and  we  see  that  dental  growth  for  each  child  increases 
in  an  approximately  linear  fashion.  Three  inferential  situations  are: 

1.  Summarization.  For  each  of  the  boy  and  girl  populations,  estimate  the  mean  and 
standard  deviation  of  pituitary  gland  measurements  at  each  of  the  four  ages. 

2.  Population  inference.  For  each  of  the  populations  of  boys  and  girls  from  which 
these  data  were  sampled,  estimate  the  average  linear  growth  over  the  age  range 
8-14  years.  Additionally,  estimate  the  average  dental  distance,  with  an  associated 
interval  estimate,  at  an  age  of  9  years. 

3.  Individual  inference.  For  a  specific  boy  or  girl  in  the  study,  estimate  the  rate 
of  growth  over  the  age  range  8-14  years  and  predict  the  growth  at  15  years. 
Additionally,  for  an  unobserved  girl,  from  the  same  population  that  produced  the 
sampled  girls,  obtain  a  predictive  growth  curve,  along  with  an  interval  envelope. 
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Table  1.3  Dental  growth  data  for  boys  and  girls 


Girl 

Age  (years) 

8  10 

12 

14 

Boy 

Age  (years) 

8  10 

12 

14 

1 

21.0 

20.0 

21.5 

23.0 

1 

26.0 

25.0 

29.0 

31.0 

2 

21.0 

21.5 

24.0 

25.5 

2 

21.5 

22.5 

23.0 

26.5 

3 

20.5 

24.0 

24.5 

26.0 

3 

23.0 

22.5 

24.0 

27.5 

4 

23.5 

24.5 

25.0 

26.5 

4 

25.5 

27.5 

26.5 

27.0 

5 

21.5 

23.0 

22.5 

23.5 

5 

20.0 

23.5 

22.5 

26.0 

6 

20.0 

21.0 

21.0 

22.5 

6 

24.5 

25.5 

27.0 

28.5 

7 

21.5 

22.5 

23.0 

25.0 

7 

22.0 

22.0 

24.5 

26.5 

8 

23.0 

23.0 

23.5 

24.0 

8 

24.0 

21.5 

24.5 

25.5 

9 

20.0 

21.0 

22.0 

21.5 

9 

23.0 

20.5 

31.0 

26.0 

10 

16.5 

19.0 

19.0 

19.5 

10 

27.5 

28.0 

31.0 

31.5 

11 

24.5 

25.0 

28.0 

28.0 

11 

23.0 

23.0 

23.5 

25.0 

12 

21.5 

23.5 

24.0 

28.0 

13 

17.0 

24.5 

26.0 

29.5 

14 

22.5 

25.5 

25.5 

26.0 

15 

23.0 

24.5 

26.0 

30.0 

16 

22.0 

21.5 

23.5 

25.0 

Fig.  1.8  Dental  growth  data 
for  boys  and  girls:  distance 
plotted  versus  age 


Age  (years) 


With  16  boys  and  11  girls,  inference  for  situation  1  can  be  achieved  by  simply 
evaluating  the  sample  mean  and  standard  deviation  at  each  time  point;  these 
quantities  are  given  in  Table  1.4.  These  simple  summaries  are  straightforward  to 
construct  and  are  based  on  independence  of  individuals.  To  obtain  interval  estimates 
for  the  means  and  standard  deviations,  one  must  be  prepared  to  make  assumptions 
(such  as  approximate  normality  of  the  measurements),  since  for  these  data  the 
sample  sizes  are  not  large  and  we  might  be  wary  of  appealing  to  large  sample 
(asymptotic)  arguments. 
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Table  1.4  Sample  means  and  standard  deviations  (SDs)  for  girls  and  boys,  by  age  group 


Age 

(years) 

Girls 

Mean  (mm) 

SD  (mm) 

Boys 

Mean  (mm) 

SD  (mm) 

8 

21.2 

2.1 

22.9 

2.5 

10 

22.2 

1.9 

23.8 

2.1 

12 

23.1 

2.4 

25.7 

2.7 

14 

24.1 

2.4 

27.5 

2.1 

For  situation  2,  we  may  fit  a  linear  model  relating  distance  to  age.  Since  there 
are  no  data  at  9  years,  to  obtain  an  estimate  of  the  dental  distance,  we  again  require 
a  model  relating  distance  to  age.  In  situation  3,  we  may  wish  to  use  the  totality  of 
data  as  an  aid  to  providing  inference  for  a  specific  child.  For  a  new  girl  from  the 
same  population,  we  clearly  need  to  use  the  existing  data  and  a  model  describing 
between-girl  differences. 

For  longitudinal  (repeated  measures)  data  such  as  these,  we  cannot  simply  fit 
models  to  the  totality  of  the  data  on  boys  or  girls  and  assume  independence  of 
measurements;  we  need  to  adjust  for  the  correlation  between  measurements  on  the 
same  child.  There  is  clearly  dependence  between  such  measurements.  For  example, 
boy  10  has  consistently  higher  measurements  than  the  majority  of  boys.  There  are 
two  distinct  approaches  to  modeling  longitudinal  data.  In  the  marginal  approach, 
the  average  response  is  modeled  as  a  function  of  covariates  (including  time), 
and  standard  errors  are  empirically  adjusted  for  dependence.  In  the  conditional 
approach,  the  response  of  each  individual  is  modeled  as  a  function  of  individual- 
specific  parameters  that  are  assumed  to  arise  from  a  distribution,  so  that  the  overall 
variability  is  partitioned  into  within-  and  between-child  components.  The  marginal 
approach  is  designed  for  estimating  population-level  questions  (as  posed  in  situation 
2)  based  on  minimal  assumptions.  Conditional  approaches  can  answer  a  greater 
number  of  inferential  questions  but  require  an  increased  number  of  assumptions 
which  decreases  their  robustness  to  model  misspecification. 

Distinguishing  Features.  Chapter  8  describes  linear  models  for  dependent  data  such 
as  these. 


1.3.6  Spinal  Bone  Mineral  Density 

Bachrach  et  al.  (1999)  analyze  longitudinal  data  on  spinal  bone  mineral  density 
(SBMD)  measurements  on  230  women  aged  between  8  and  27  years  and  of  one  of 
four  ethnic  groups:  Asian,  Black,  Hispanic,  and  White.  The  aim  of  this  study  was  to 
examine  ethnic  differences  in  SBMD. 

Figure  1.9  displays  the  SBMD  measurements  by  individual,  with  one  panel  for 
each  of  the  four  races.  The  relationship  between  SBMD  and  age  is  clearly  nonlinear, 
and  there  are  also  woman-specific  differences  in  overall  level  so  that  observations 
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10  15  20  25 


Age  (years) 

Fig.  1.9  Spinal  bone  mineral  density  measurements  as  a  function  of  age  and  ethnicity.  Points  that 
are  connected  represent  measurements  from  the  same  woman 


on  the  same  woman  are  correlated.  Letting  Yi3  represent  the  SBMD  measurement 
on  woman  i  at  age  age(? ,  we  might  propose  a  mean  model  of  the  form 

E  [Yij  |  age.y]  =  xz(3  +  /( agey)  +  b. 


where  Xi  is  a  1  x  4  row  vector  with  a  single  one  and  three  zeroes  that  represents 
the  ethnicity  of  woman  i  (coded  in  the  order  Hispanic,  White,  Asian,  Black),  with 
(3  =  \Ph,  Pwi  Pa,  PbY  the  4  x  1  vector  of  associated  regression  coefficients, 
/  (age,() )  is  a  function  that  varies  smoothly  with  age,  and  bi  is  a  woman-specific 
intercept  which  is  included  to  account  for  dependencies  of  measurements  on  the 
same  individual.  The  relationship  between  SBMD  and  age  is  not  linear  and  not  of 
primary  interest.  Consequently,  we  would  like  to  use  a  flexible  model  form,  and  we 
may  not  be  concerned  if  this  model  does  not  contain  easily  interpretable  parameters. 
Nonparametric  regression  is  the  term  we  use  to  refer  to  flexible  mean  modeling. 

Distinguishing  Features.  The  analysis  of  these  data  requires  both  a  flexible  mean 
model  for  the  age  effect  and  acknowledgement  of  the  dependence  of  measure¬ 
ments  on  the  same  woman.  Chapters  10-12  describe  models  that  allow  for  these 
possibilities. 
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1.4  Nature  of  Randomness 

Regression  models  consist  of  both  deterministic  and  stochastic  (random)  compo¬ 
nents,  and  a  consideration  of  the  sources  of  the  randomness  is  worthwhile,  both 
to  interpret  parameters  contained  in  the  deterministic  component  and  to  model 
the  stochastic  component.  We  initially  consider  an  idealized  situation  in  which  a 
response  is  completely  deterministic,  given  sufficient  information,  and  randomness 
is  only  induced  by  missing  information.3  Let  y  denote  a  variable  with  values 
2/i, . . . ,  2/jv  within  a  population.  We  begin  with  a  very  simple  deterministic  model 

Vi  =  Po  +  PiXi  +  IZi  (1-6) 

for  i  =  so  that,  given  Xi  and  Zi  (and  knowing  po,Pi  and  7),  yt  is 

completely  determined.  Suppose  we  only  measure  y,;  and  x,  and  assume  the  model 

=  Po  +  P\xi  +  £j. 

To  interpret  /3g  and  P*,  we  need  to  understand  the  relationship  between  27  and  z%, 
i  =  1, . . . ,  N.  To  this  end,  write 


Zi  =  a  +  bxi  +  5i,  (1.7) 

i  =  1 , ,N.  This  form  does  not  in  any  sense  assume  that  a  linear  association  is 
appropriate  or  “correct”,  rather  it  is  the  linear  approximation  to  E[Z  \x\.  In  (1.7), 
we  may  take  a  and  b  as  the  least  squares  estimates  from  fitting  a  linear  model  to  the 
data  [x-i,  Zj],  i  =  1, ...  ,N.  Substitution  of  (1.7)  into  (1.6)  yields 

Hi  =  A)  +  PiXi  +  7  (a  +  bxi  +  Si) 

=  Pq  +  Pi  Xi  +  £i 

where 


Po  =  Po  +  a7 

P*  =  Pi  +  by 

=  ySi,  i  =  l,...,N,  (1.8) 


3  When  simulations  are  performed,  pseudorandom  numbers  are  generated  via  deterministic  se¬ 
quences.  For  example,  consider  the  sequence  generated  by  the  congruential  generator 

Xi  =  aXi-i,  mod(m) 

along  with  initial  value  (or  “seed”)  Xo.  Then  Xi  takes  values  in  0, 1, . . . ,  m— 1,  and  pseudorandom 
numbers  are  obtained  as  Ui  =  X i/m,  where  Xq,  a,  and  m  are  chosen  so  that  the  Ui  s  have 
(approximately)  the  properties  of  uniform  U(0, 1)  random  variables.  However,  if -Yo,  a,  and  m  are 
known,  the  randomness  disappears!  Ripley  (1987,  Chap.  2)  provides  a  discussion  of  pseudorandom 
variable  generation  and  specifically  “good”  choices  of  a  and  m. 
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so  that  Pi  is  a  combination  of  the  direct  effect  of  x,  on  cmr/  the  effect  of  2,., 
through  the  linear  association  between  z.-t  and  x,.  This  development  illustrates  the 
problems  in  nonrandomized  situations  of  estimating  the  causal  effect  of  a on  y, , 
that  is,  p\.  Turning  to  the  stochastic  component  (1.8)  illustrates  that  properties  of 
e,  are  inherited  from  d, .  Hence,  assumptions  such  as  constancy  of  variance  of  r, 
depend  on  the  nature  of  zt  and,  in  particular,  on  the  joint  distribution  of  x,  and  zt. 

Increasing  slightly  the  realism,  we  extend  the  original  deterministic  model  to 

p  9 

Vi  =  Po  +  ^2  +  5Z  IkZik-  (1-9) 

j= 1  fc=i 

Suppose  we  only  measure  Xu  ,■■■ ,  xlp  and  assume  the  simple  model 

p 

Yi=PZ  +  Y,Pjxij+£h  (1.10) 

3= 1 

where  the  errors,  e*,  now  correspond  to  the  totality  of  scaled  versions  of  the  Zik’s 
that  remain  after  extracting  the  linear  associations  with  the  x-ij  ’s  by  analogy  with 
(1.7)  and  (1.8). 

Viewing  the  error  terms  as  sums  of  random  variables  and  considering  the  central 
limit  theorem  (Appendix  G)  naturally  leads  to  the  normal  distribution  as  a  plausible 
error  distribution.  There  is  no  compelling  reason  to  believe  that  the  variance  of  this 
normal  distribution  will  be  constant  across  the  space  of  the  x  variables,  however. 

We  have  distinguished  between  the  regression  coefficients  in  the  assumed  model 
(1.10),  denoted  by  /3*,  and  those  in  the  original  model  (1.9),  denoted  pj.  In  general, 
Pj  ^  Pj,  because  of  the  possible  effects  of  confounding  which  occurs  due  to 
dependencies  between  x,;7  and  elements  of  zt  =  [zu, . . . ,  z,iq}.  In  the  example  just 
considered,  only  if  x-i3  is  linearly  independent  of  the  Zik  will  the  coefficients  pj 
and  Pj  coincide.  For  nonlinear  models,  the  relationship  between  the  two  sets  of 
coefficients  is  even  more  complex. 

This  development  illustrates  that  an  aim  of  regression  modeling  is  often  to 
“explain”  the  error  terms  using  observed  covariates.  In  general,  error  terms  represent 
not  only  unmeasured  variables  but  also  data  anomalies,  such  as  inaccurate  recording 
of  responses  and  covariates,  and  model  misspecification.  Clearly  the  nature  of  the 
randomness,  and  the  probabilities  we  attach  to  different  events,  is  conditional  upon 
the  information  that  we  have  available  and,  specifically,  the  variables  we  measure. 

Similar  considerations  can  be  given  to  other  types  of  random  variables.  For 
example,  suppose  we  wish  to  model  a  binary  random  variable  Y  taking  values  coded 
as  0  and  1.  Sometimes  it  will  be  possible  to  link  Y  to  an  underlying  continuous 
latent  variable  and  use  similar  arguments  to  that  above.  To  illustrate,  Y  could  be  an 
indicator  of  low  birth  weight  and  is  a  simple  function  of  the  true  birth  weight,  U, 
which  is  itself  associated  with  many  covariates.  We  may  then  model  the  probability 
of  low  birth  weight  as  a  function  of  covariates  x ,  via 


p(x)  =  Pr(y  =  1  |  x)  =  Pr((7  <  uq  \  x)  =  E[Y  |  x\1 


22 


1  Introduction  and  Motivating  Examples 


where  uq  is  the  threshold  value  that  determines  whether  a  child  is  classified  as  low 
birth  weight  or  not.  This  development  is  taken  further  in  Sects.  7.6.1  and  9.13. 

The  above  gives  one  a  way  of  thinking  about  where  the  random  terms  in  models 
arise  from,  namely  as  unmeasured  covariates.  In  terms  of  distributional  assumptions, 
some  distributions  arise  naturally  as  a  consequence  of  simple  physical  models.  For 
example,  suppose  we  are  interested  in  modeling  the  number  of  events  occurring  over 
time.  The  process  we  now  describe  has  been  found  empirically  to  model  a  number  of 
phenomena,  for  example  the  arrival  of  calls  at  a  telephone  exchange  or  the  emission 
of  particles  from  a  radioactive  source.  Let  the  rate  of  occurrences  be  denoted  by 
p  >  0  and  N(t,t  +  At)  be  the  number  of  events  in  the  interval  (t,t  +  At] .  Suppose 
that,  informally  speaking,  At  tends  to  zero  from  above  and  that 

Pr  [N(t,  t  +  At)  =  0]  =  1  —  pAt  +  o(At), 

Pr  [N(t,  t  +  At)  =  1]  =  pAt  +  o(At), 

so  that  Pr  [N(t,  t  +  At)  >  1]  =  o(At).  The  notation  o(At)  represents  a  function 
that  tends  to  zero  more  rapidly  than  At.  Finally,  suppose  that  N(t,  t  +  At)  is 
independent  of  occurrences  in  (0,  t].  Then  we  have  a  Poisson  process,  and  the 
number  of  events  occurring  in  the  fixed  interval  (t,  t  +  h]  is  a  Poisson  random 
variable  with  mean  ph. 

Other  distributions  are  “artificial.”  For  example,  a  number  of  distributions  arise 
as  functions  of  normal  random  variables  (such  as  Student’s  t,  Snedecor’s  F,  and 
chi-squared  random  variables)  or  may  be  dreamt  up  for  flexible  and  convenient 
modeling  (as  is  the  case  for  the  so-called  Pearson  family  of  distributions). 

Models  can  arise  from  idealized  views  of  the  phenomenon  under  study,  but  then 
we  might  ask:  “If  we  could  measure  absolutely  everything  we  wanted  to,  would 
there  be  any  randomness  left?”  In  all  but  the  simplest  experiments,  this  question  is 
probably  not  that  practically  interesting,  but  the  central  idea  of  quantum  mechanics 
tells  us  that  probability  is  still  needed,  because  some  experimental  outcomes  are 
fundamentally  unpredictable  (e.g.,  Feynman  1951). 


1.5  Bayesian  and  Frequentist  Inference 

What  distinguishes  the  field  of  statistics  from  the  use  of  statistical  techniques  in  a 
particular  discipline  is  a  principled  approach  to  inference  in  the  face  of  uncertainty. 
There  are  two  dominant  approaches  to  inference,  which  we  label  as  Bayesian  and 
frequentist,  and  each  produces  inferential  procedures  that  are  optimal  with  respect 
to  different  criteria. 

In  Chaps.  2  and  3,  we  describe,  respectively,  the  frequentist  and  Bayesian 
approaches  to  statistical  inference.  Central  to  the  philosophy  of  each  approach 
is  the  interpretation  of  probability  that  is  taken.  In  the  frequentist  approach,  as 
the  name  suggests,  probabilities  are  viewed  as  limiting  frequencies  under  infinite 
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hypothetical  replications  of  the  situation  under  consideration.  Inferential  recipes, 
such  as  specific  estimators,  are  assessed  with  respect  to  their  performance  under 
repeated  sampling  of  the  data,  with  model  parameters  viewed  as  fixed,  albeit 
unknown,  constants.  By  contrast,  in  the  Bayesian  approach  that  is  described  in 
this  book,  probabilities  are  viewed  as  subjective  and  are  interpreted  conditional  on 
the  available  information.  As  a  consequence,  assigned  probabilities  concerning  the 
same  event  may  differ  between  individuals.  In  this  sense  probabilities  do  not  exist 
as  they  vary  as  a  function  of  the  available  information.  All  unknown  parameters  in 
a  model  are  treated  as  random  variables,  and  inference  is  based  upon  the  (posterior) 
probability  distribution  of  these  parameters,  given  the  data  and  other  available 
information.  Practically  speaking,  the  interpretation  of  probability  is  less  relevant 
than  the  number  of  assumptions  that  are  required  for  valid  inference  (which  has 
implications  for  the  robustness  of  analysis)  and  the  breadth  of  inferential  questions 
that  can  be  answered  using  a  particular  approach. 

It  should  be  stressed  that  many  issues  arising  in  the  analysis  of  regression 
data  (such  as  the  nature  of  the  sampling  scheme,  parameter  interpretation,  and 
misspecification  of  the  mean  model)  are  independent  of  philosophy  and  in  practice 
are  usually  of  far  greater  importance  than  the  inferential  approach  taken  to  analysis. 

Each  of  the  frequentist  and  Bayesian  approaches  have  their  merits  and  can  often 
be  used  in  tandem,  an  approach  we  follow  and  advocate  throughout  this  book.  If 
substantive  conclusions  differ  between  different  approaches,  then  discovering  the 
reasons  for  the  discrepancies  can  be  informative  as  it  may  reveal  that  a  particular 
analysis  is  leaning  on  inappropriate  assumptions  or  that  relevant  information  is 
being  ignored  by  one  of  the  approaches.  Those  situations  in  which  one  of  the 
approaches  is  more  or  less  suitable  will  also  be  distinguished  throughout  this  book, 
with  a  short  summary  being  given  in  the  next  section. 


1.6  The  Executive  Summary 

I  would  like  to  briefly  summarize  my  view  on  when  to  take  Bayesian  or  frequentist 
approaches  to  estimation.  As  the  examples  throughout  this  book  show,  on  many 
occasions,  if  one  is  careful  in  execution,  both  approaches  to  analysis  will  yield 
essentially  equivalent  inference.  For  small  samples,  the  Bayesian  approach  with 
thoughtfully  specified  priors  is  often  the  only  way  to  go  because  of  the  difficulty 
in  obtaining  well-calibrated  frequentist  intervals.  An  example  of  such  a  sparse 
data  occasion  is  given  at  the  end  of  Sect.  6.16.  For  medium  to  large  samples, 
unless  there  is  strong  prior  information  that  one  wishes  to  incorporate,  a  robust 
frequentist  approach  using  sandwich  estimation  (or  quasi-likelihood  if  one  has  faith 
in  the  variance  model)  is  very  appealing  since  consistency  is  guaranteed  under 
relatively  mild  conditions.  For  highly  complex  models  (e.g.,  with  many  random 
effects),  a  Bayesian  approach  is  often  the  most  convenient  way  to  formulate  the 
model,  and  computation  under  the  Bayesian  approach  is  the  most  straightforward. 
The  modeling  of  spatial  dependence  in  Sect.  9.7  provides  one  such  example  in 
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which  the  Bayesian  approach  is  the  simplest  to  implement.  The  caveat  to  complex 
modeling  is  that  in  most  cases  consistency  of  inference  is  only  available  if  all 
stages  of  the  model  are  correctly  specified.  Consequently,  if  one  really  cares  about 
interval  estimates,  then  extensive  model  checking  will  be  necessary.  If  formal 
inference  is  not  required  but  rather  one  is  in  an  exploratory  phase,  then  there  is  far 
greater  freedom  to  experiment  with  the  approaches  that  one  is  most  familiar  with, 
including  nonparametric  regression.  In  this  setting,  using  procedures  that  are  less 
well-developed  statistically  is  less  dangerous. 

In  contrast  to  estimation,  hypothesis  testing  using  frequentist  and  Bayesian 
methods  can  often  produce  starkly  differing  results,  even  in  large  samples.  As 
discussed  in  Chap.  4, 1  think  that  hypothesis  testing  is  a  very  difficult  endeavor,  and 
tests  applied  using  the  frequentist  approach,  as  currently  practiced  (with  a  levels 
being  fixed  regardless  of  sample  size),  can  be  very  difficult  to  interpret.  In  general, 
I  prefer  estimation  to  hypothesis  testing. 

As  a  final  comment,  as  noted,  in  many  instances  carefully  conducted  frequentist 
and  Bayesian  approaches  will  lead  to  similar  substantive  conclusions;  hence,  the 
choice  between  these  approaches  can  often  be  based  on  that  which  is  most  natural 
(i.e.,  based  on  training  and  experience)  to  the  analyst.  Consequently,  throughout  this 
book,  methods  are  discussed  in  terms  of  their  advantages  and  shortcomings,  but  a 
strong  recommendation  of  one  method  over  another  is  usually  not  given  as  there  is 
often  no  reason  for  stating  a  preference. 


1.7  Bibliographic  Notes 

Rosenbaum  (2002)  provides  an  in-depth  discussion  of  the  analysis  of  data  from 
observational  studies,  and  an  in-depth  treatment  of  causality  is  the  subject  of 
Pearl  (2009).  A  classic  text  on  survey  sampling  is  Cochran  (1977)  with  Korn 
and  Graubard  (1999)  and  Lumley  (2010)  providing  more  recent  presentations. 
Regression  from  a  survey  sampling  viewpoint  is  discussed  in  the  edited  volume  of 
Chambers  and  Skinner  (2003).  Errors-in-variables  is  discussed  in  detail  by  Carroll 
et  al.  (2006)  and  missing  data  by  Little  and  Rubin  (2002).  Johnson  et  al.  (1994, 
1995,  1997);  Kotz  et  al.  (2000),  and  Johnson  et  al.  (2005)  provide  a  thorough 
discussion  of  the  genesis  of  univariate  and  multivariate  discrete  and  continuous 
probability  distributions  and,  in  particular,  their  relationships  to  naturally  occurring 
phenomena.  Barnett  (2009)  provides  a  discussion  of  the  mechanics  and  relative 
merits  of  Bayesian  and  frequentist  approaches  to  inference;  see  also  Cox  (2006). 
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Inferential  Approaches 


Chapter  2 

Frequentist  Inference 


2.1  Introduction 

Inference  from  data  can  take  many  forms,  but  primary  inferential  aims  will  often  be 
point  estimation,  to  provide  a  “best  guess”  of  an  unknown  parameter,  and  interval 
estimation,  to  produce  ranges  for  unknown  parameters  that  are  supported  by  the 
data.  Under  the  frequentist  approach,  parameters  and  hypotheses  are  viewed  as 
unknown  but  fixed  (nonrandom)  quantities,  and  consequently  there  is  no  possibility 
of  making  probability  statements  about  these  unknowns.1  As  the  name  suggests, 
the  frequentist  approach  is  characterized  by  a  frequency  view  of  probability,  and  the 
behavior  of  inferential  procedures  is  evaluated  under  hypothetical  repeated  sampling 
of  the  data. 

Frequentist  procedures  are  not  typically  universally  applicable  to  all  models/ 
sample  sizes  and  often  require  “fixes.”  For  example,  a  number  of  variants  of 
likelihood  have  been  developed  for  use  in  particular  situations  (Sect.  2.4.2).  In 
contrast,  the  Bayesian  approach,  described  in  Chap.  3,  is  completely  prescriptive, 
though  there  are  significant  practical  hurdles  to  overcome  (such  as  likelihood 
and  prior  specification)  in  pursuing  that  prescription.  In  addition,  in  situations  in 
which  frequentist  procedures  encounter  difficulties,  Bayesian  approaches  typically 
require  very  careful  prior  specification  to  avoid  posterior  distributions  that  exhibit 
anomalous  behavior. 

The  outline  of  this  chapter  is  as  follows.  We  begin  our  discussion  in  Sect.  2.2 
with  an  overview  of  criteria  by  which  frequentist  procedures  may  be  evaluated.  In 
Sect.  2.3  we  present  a  general  development  of  estimating  functions  which  provide 
a  unifying  framework  for  defining  and  establishing  the  properties  of  commonly 
used  frequentist  procedures.  Two  important  classes  of  estimating  functions  are  then 


1  Random  effects  models  provide  one  example  in  which  parameters  are  viewed  as  random  from  a 
frequentist  perspective  and  are  regarded  as  arising  from  a  population  of  such  effects.  Frequentist 
inference  for  such  models  is  described  in  Part  III  of  this  book. 
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Fig.  2.1  Exploratory  plot  of 
log  SMR  for  lung  cancer 
versus  average  residential 
radon,  with  a  local  smoother 
superimposed,  for  85  counties 
in  Minnesota 


Average  Radon  (pCi/liter) 


introduced:  those  arising  from  the  specification  of  a  likelihood  function,  in  Sect.  2.4, 
and  those  from  a  quasi-likelihood  function,  in  Sect.  2.5.  A  recurring  theme  is  the 
assessment  of  frequentist  procedures  under  model  misspecification.  In  Sect.  2.6 
we  discuss  the  sandwich  estimation  technique  which  provides  estimation  of  the 
standard  error  of  estimators  in  more  general  circumstances  than  were  assumed  in 
deriving  the  estimator.  Section  2.7  introduces  the  bootstrap,  which  is  a  simulation- 
based  method  for  making  inference  with  reduced  assumptions.  Section  2.8  discusses 
the  choice  of  an  estimating  function.  Hypothesis  testing  is  considered  in  Sect.  2.9, 
and  the  chapter  ends  with  concluding  remarks  in  Sect.  2.10.  To  provide  some 
numerical  relief  to  the  mostly  methodological  development  of  this  chapter,  we 
provide  one  running  example. 


Example:  Lung  Cancer  and  Radon 

We  consider  the  data  introduced  in  Sect.  1.3.3  and  examine  the  association  between 
counts  of  lung  cancer  incidence,  Y.;  ,  and  the  average  residential  radon,  x,,  in  county 
i  with  i  =  1, . . . ,  85,  indexing  the  counties  within  which  radon  measurements  were 
available  (in  two  counties  no  radon  data  were  reported).  We  examine  the  association 
using  the  loglinear  model 


logEfSMR,  |  Xi\  =p0+PiXi.  (2.1) 

where  SMR,  =  Y, / E,  (with  /x,  the  expected  count)  is  the  standardized  mortality 
ratio  in  county  i  (Sect.  1.3.3)  and  is  a  summary  measure  that  controls  for  the 
differing  age  and  gender  populations  across  counties.  We  take  as  our  parameter  of 
interest  exp (/3i)  which  is  the  multiplicative  change  in  risk  associated  with  a  1  pCi/1 
increase  in  radon.  In  the  epidemiological  literature  this  parameter  is  referred  to  as 
the  relative  risk:  here  it  corresponds  to  the  risk  ratio  for  two  areas  whose  radon 
exposures  x  differ  by  one  unit. 

To  first  order,  E[log  SMR  |  x]  ss  log  E[SMR  |  x],  and  so  if  (2.1)  is  an  appropriate 
model,  a  plot  of  log  S M R ,  versus  xt  should  display  an  approximately  linear  trend; 
Fig.  2. 1  shows  this  plot  with  a  local  smoother  superimposed  and  indicates  a  negative 
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association.  This  example  is  illustrative,  and  so  distracting  issues,  such  as  the  effect 
of  additional  covariates  (including  smoking,  the  major  confounder)  and  residual 
spatial  dependence  in  the  counts,  will  be  conveniently  ignored. 


2.2  Frequentist  Criteria 

In  this  section  we  describe  frequentist  criteria  by  which  competing  estimators  may 
be  compared  and  discuss  conditions  under  which  optimal  estimators  exist  under 
these  criteria.  Under  the  frequentist  approach  to  inference,  the  fundamental  outlook 
is  that  statistical  procedures  are  assessed  with  respect  to  their  performance  under 
hypothetical,  repeated  sampling  of  the  data,  under  fixed  values  of  the  parameters. 
In  this  section,  for  simplicity,  we  consider  the  estimation  of  a  univariate  parameter 
9  and  let  Y  =  [kj , . . . ,  K„]T  represent  a  vector  of  n  random  variables  and  y  = 
[yi, . . . ,  yn]T  a  realization.  Often  inference  will  be  summarized  via  a  100(1  —  a)% 
confidence  inten’al  for  9,  which  is  an  interval  [  a(Y),  b(Y)  ]  such  that 


Pr{0e  [a(Y),b(Y)]}  =  l-a, 


(2.2) 


for  all  9,  where  the  probability  statement  is  with  respect  to  the  distribution  of  Y 
and  1  —  a  is  known  as  the  coverage  probability.  For  interpretation  it  is  crucial 
to  recognize  that  the  random  quantities  in  (2.2)  are  the  endpoints  of  the  interval 
[  a(Y),b(Y)  ],  so  that  we  are  not  assigning  a  probability  statement  to  9.  The 
correct  interpretation  of  a  confidence  interval  is  that,  under  hypothetical  repeated 
sampling,  a  proportion  1  —  a  of  the  intervals  created  will  contain  the  true  value  9. 
We  emphasize  that  we  cannot  say  that  the  specific  interval  [  a(y),  b{y)  ]  contains  9 
with  probability  1  —  a. 

Ideally,  we  would  like  to  determine  the  shortest  possible  confidence  interval  for 
a  given  a.  The  search  for  such  intervals  is  closely  linked  to  the  determination  of 
optimal  point  estimators  of  9.  The  point  estimator  9(Y)  of  9  represents  a  random 
variable,  with  an  associated  sampling  distribution,  while  the  point  estimate  9{y)  is  a 
specific  value.  In  any  given  situation  a  host  of  potential  estimators  are  available,  and 
we  require  criteria  by  which  to  judge  competing  choices.  Heuristically  speaking,  a 
good  estimator  will  have  a  sampling  distribution  that  is  concentrated  “close”  to  the 
true  value  9,  where  “close”  depends  on  the  distance  measure  that  we  apply  to  the 
distribution  of  9(Y). 

One  natural  measure  of  closeness  is  the  mean  squared  error  (MSE)  of  O(  Y) 
which  arises  from  a  quadratic  loss  function  for  estimation  and  is  defined  as 


varyqg  9(Y)  +  bias  9(Y) 
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where  the  bias  of  the  estimator  is 


bias 


9(Y) 


=  E 


•Y\e 


6(Y) 


-  e. 


This  notation  stresses  that  all  expectations  are  with  respect  to  the  sampling 
distribution  of  the  estimator,  given  the  true  value  of  the  parameter;  this  is  a  crucial 
aspect  but  the  notation  is  cumbersome  and  so  will  be  suppressed.  Finding  estimators 
with  minimum  MSE  for  all  values  of  9  is  not  possible.  For  example,  9(Y)  =  3  has 
zero  MSE  for  9  =  3  (and  so  is  optimal  for  this  9\)  but  is,  in  general,  a  disastrous 
estimator. 

An  elegant  theory,  which  is  briefly  summarized  in  Appendix  G,  has  been 
developed  to  characterize  uniformly  minimum-variance  unbiased  estimators 
(UMVUEs).  The  theory  depends  first  on  writing  down  a  full  probability  model 
for  the  data,  p(y  \  9).  We  assume  conditional  independence  so  that  p(y  \  9)  = 
nil,  |  9).  The  Cramer-Rao  lower  bound  for  any  unbiased  estimator  0  of  a 
scalar  function  of  interest  <f>  =  (j>{9)  is 


var  ((/>)  >  — 


wm2 

e  \m 

11  Lae2  J 


(2.3) 


where  1(9)  =  Y^i=i  log p(yi  I  9)  is  the  log  of  the  joint  distribution  of  the  data, 
viewed  as  a  function  of  9.  If  T(Y)  is  a  sufficient  statistic  of  dimension  1 ,  then,  under 
suitable  regularity  conditions,  there  is  a  unique  function  (f)(9)  for  which  a  UMVUE 
exists  and  its  variance  attains  the  Cramer-Rao  lower  bound.  Further,  a  UMVUE  only 
exists  when  the  data  are  independently  sampled  from  a  one-parameter  exponential 
family.  Specifically,  suppose  that  p(y,  \  9)  is  of  one-parameter  exponential  family 
form,  so  that  its  distribution  may  be  written,  for  suitably  defined  functions,  as 


p(y  |  9)  =  exp  [ 9T(y )  -  b(9)  +  c(y )]  .  (2.4) 

In  this  situation,  there  is  a  unique  function  of  9  for  which  a  UMVUE  exists. 
Unfortunately,  this  theory  only  covers  a  narrow  range  of  circumstances.  There  are 
methods  available  for  constructing  estimators  with  the  minimal  attainable  variance 
in  additional  situations  but  even  this  wider  class  of  models  does  not  come  close  to 
covering  the  range  of  models  that  we  would  like  to  consider  for  practical  application. 
UMVUEs  are  also  not  always  sensible;  see  Exercise  2.2. 

As  discussed  in  Sect.  1 .2,  model  formulation  should  begin  with  a  model  that  we 
would  like  to  fit,  before  proceeding  to  examine  its  mathematical  properties.  As  we 
will  see,  exponential  family  models  can  provide  robust  inference,  in  the  sense  of 
performing  well  even  if  certain  aspects  of  the  assumed  model  are  wrong,  but  to  only 
consider  these  models  is  unnecessarily  restrictive. 

We  now  discuss  how  estimators  may  be  compared  in  general  circumstances 
asymptotically,  that  is,  as  n  — >  oo.  There  are  two  hypothetical  situations  that  are 
being  considered  here.  The  first  is  the  repeated  sampling  aspect  for  fixed  n,  and  the 
second  is  allowing  n  — >  oo.  The  asymptotic  properties  of  frequentist  procedures 
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may  be  used  in  two  respects.  The  first  is  to  justify  particular  procedures,  and  the 
second  is  to  carry  out  inference,  for  example,  to  construct  confidence  intervals.  We 
might  question  the  relevance  of  asymptotic  criteria,  since  in  any  practical  situation  n 
is  finite,  and  an  inconsistent  or  asymptotically  inefficient  estimator  may  have  better 
finite  sample  properties  (a  reduced  MSE  for  instance)  than  a  consistent  alternative. 
On  the  other  hand,  for  many  commonly  used  models,  asymptotic  inference  is  often 
accurate  for  relatively  small  sample  sizes  (as  we  will  see  in  later  chapters). 

While  unbiasedness  of  estimators,  per  se,  is  of  debatable  value,  a  fundamentally 
important  frequentist  criterion  for  assessing  an  estimator  is  consistency.  Weak 
consistency  states  that  asn->  oo,  9n  —>p  0  (Appendix  F),  that  is, 

Pr(|  0n  —  9  |>  e)  — >  0  as  n  — >  oo  for  any  e  >  0. 

Intuitively,  the  distribution  of  a  consistent  estimator  concentrates  more  and  more 
around  the  true  value  as  the  sample  size  increases.  In  all  but  pathological  cases, 
a  consistent  estimator  is  asymptotically  unbiased,  though  the  contrary  is  not  true. 
For  example,  consider  the  model  with  E [Yj  |  9\  =  9,  i  =  1 , ,n,  and  the  estimator 
9  =  Yi,  this  estimator  is  unbiased  but  inconsistent. 

When  assessing  an  estimator,  once  consistency  has  been  established,  asymptotic 
normality  of  the  estimator  is  then  typically  sought,  and  interest  focuses  on  the 
variance  of  the  estimator.  In  particular,  the  asymptotic  relative  efficiency,  or  more 
simply  the  efficiency,  allows  an  estimator  9n  to  be  compared  to  the  estimator  with 
the  smallest  variance  9n  via 

var  (6n) 
var  (0n) 

The  100(1  —  a)%  asymptotic  confidence  interval  associated  with  an  estimator  9n  is 

9n  ±  2i_«/2  x  \J Vctt(0n)  (2.5) 

where  Z  ~  N(0, 1)  and  Pr (Z  <  Zi_a/2)  =  1  —  a/2.  If  9n  is  asymptotically 
efficient,  then  interval  (2.5)  is  (asymptotically)  the  shortest  available.  Maximum 
likelihood  estimation  (Sect.  2.4)  provides  a  method  for  finding  efficient  estimators. 

A  difficulty  with  the  interpretation  of  frequentist  inferential  summaries  is  that  all 
probability  statements  refer  to  hypothetical  data  replications  and  to  the  estimator, 
and  not  to  the  estimate  from  a  specific  realization  of  data.  This  can  lead  to  intervals 
with  poor  properties.  Exercise  2.1  describes  an  instance  in  which  the  confidence 
coverage  is  correct  on  average,  but  for  some  realizations  of  the  data,  the  interval  has 
100%  coverage. 

We  summarize  this  section  and  provide  a  road  map  to  the  remainder  of  the 
chapter.  A  fundamental,  desirable  criterion  is  to  produce  confidence  intervals  that 
are  the  shortest  possible.  Only  in  stylized  situations  may  estimators  with  minimum 
variance  be  found  in  non-asymptotic  situations.  Asymptotically,  the  picture  is  rosier, 
however.  In  the  next  section  we  describe  a  general  class  of  estimators  and  give 
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results  concerning  consistency  and  asymptotic  normality.  Subsequently,  we  show 
that  maximum  likelihood  estimators  attain  the  smallest  asymptotic  variance  (subject 
to  regularity  conditions)  if  the  model  is  correctly  specified.  We  then  consider  quasi¬ 
likelihood,  sandwich  estimation,  and  the  bootstrap,  each  of  which  is  designed  to 
reduce  the  reliance  of  inference  on  a  full  probability  model  specification. 


2.3  Estimating  Functions 

In  the  last  section  we  saw  that  optimal  estimators  can  be  found  when  a  full 
probability  model  is  assumed.  The  need  to  specify  a  full  probability  model  for 
the  data  is  undesirable.  While  a  practical  context  may  suggest  a  mean  model  and 
perhaps  an  appropriate  mean-variance  relationship,  it  is  rare  to  have  faith  in  a  choice 
for  the  distribution  of  the  data.  In  this  section  we  give  a  framework  within  which  the 
asymptotic  properties  of  a  broad  range  of  estimation  recipes  may  be  evaluated. 

Let  Y  =  [Lj , . . . ,  Yn\  represent  n  observations  from  a  distribution  indexed  by  a 
p-dimensional  parameter  6,  with  cov(l) .  Y:/  \  0)  =  0,  i  f  j.  In  the  following  we 
will  not  rigorously  derive  asymptotic  results  and  only  informally  discuss  regularity 
conditions  under  which  the  results  hold.  The  models  discussed  subsequently  will, 
unless  otherwise  stated,  obey  the  necessary  conditions. 

In  the  following,  for  ease  of  presentation,  we  assume  that  Yi?  i  =  1, . . . ,  n,  are 
independent  and  identically  distributed  (iid).2  An  estimating  function  is  a  function, 

1  n 

Gn(e)=-YjG{d,Yi),  (2.6) 

n  z — ' 

2=1 

of  the  same  dimension  as  0  for  which 

E[Gn(0)}  =  0  (2.7) 

for  all  0.  The  estimating  function  Gn  (6)  is  a  random  variable  because  it  is  a  function 
of  Y.  The  corresponding  estimating  equation  that  defines  the  estimator  6n  has 
the  form 

i  n 

Gn(0n)  =  -Y/G(0n,Yi)  =  0.  (2.8) 

i= 1 

For  inference  the  asymptotic  properties  of  the  estimating  function  are  derived 
(which  is  why  we  index  the  estimating  function  by  n),  and  these  are  transferred 
to  the  resultant  estimator.  The  estimator  6n  that  solves  (2.8)  will  often  be  unavailable 
in  closed  form  and  so  deriving  its  distribution  from  that  of  the  estimating  function 


2In  a  regression  setting  we  have  independently  distributed  observations  only,  because  the  distribu¬ 
tion  of  the  outcome  changes  as  a  function  of  covariates. 
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is  an  ingenious  step,  because  the  estimating  function  may  be  constructed  to  be  a 
simple  (e.g.,  linear)  function  of  the  data.  The  estimating  function  defined  in  (2.6)  is  a 
sum  of  random  variables,  which  provides  the  opportunity  to  evaluate  its  asymptotic 
properties  via  a  central  limit  theorem  since  the  first  two  moments  will  often  be 
straightforward  to  calculate.  The  art  of  constructing  estimating  functions  is  to  make 
them  dependent  on  distribution-free  quantities,  for  example,  the  first  two  moments 
of  the  data;  robustness  of  inference  to  misspecification  of  higher  moments  often 
follows. 

We  now  state  an  important  result  that  will  be  used  repeatedly  in  the  context  of 
frequentist  inference. 

Result  2.1.  Suppose  that  9n  is  a  solution  to  the  estimating  equation 

Gn{0)  =  —  y''  G(0,  Yj)  =  0, 


that  is,  Gn(Qn)  =  0.  Then  6n  —>p  6  (consistency)  and 

spd  ( en  -  6)  d  Np  [0,  A~1B(AT)~1] 
(asymptotic  normality),  where 


A  =  A{9)  =  E 


w^e’Y) 


and 


(2.9) 


B  =  B(0 )  =  E[G(9,  Y)G{9 ,  Y)T]  =  var  [G{9,  F)] . 


Outline  Derivation 


We  refer  the  interested  reader  to  van  der  Vaart  (1998,  Sect.  5.2)  for  a  proof  of 
consistency  and  present  an  outline  derivation  of  asymptotic  normality,  based  on 
van  der  Vaart  (1998,  Sect.  5.3).  For  simplicity  we  assume  that  9  is  univariate. 

We  expand  Gn(6)  in  a  Taylor  series  around  the  true  value  9: 


0  =  Gn(9n) 


Gn(9)  +  (9n 


0) 


dGn 

d9 


(2.10) 


where  9  is  a  point  between  9n  and  9.  We  rewrite  (2.10)  as 


Vn  (0n  -  0) 


dGn 

d9 


-sjn.  Gn(9) 

+  \{0n-9)Z% 


(2.11) 
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and  determine  the  asymptotic  distribution  of  the  right-hand  side,  beginning  with  the 
distribution  of  Gn(0).  To  apply  a  central  limit  theorem,  note  that  E [Gn{6)\  =  0  and 

n  x  var  [Gn(6)}  =  var  [G{6,  F)]  =  E  [G(6,  F )2]  =  B 

(which  we  assume  is  finite).  Consequently,  by  the  central  limit  theorem 
(Appendix  G), 

v/SGn(fl)  -ij  N[0,B(«)].  (2.12) 

We  now  transfer  the  properties  of  the  estimating  function  to  the  estimator  8n  via 
(2. 11).  The  first  term  of  the  denominator  of  (2. 1 1 ), 


dGn 

dd 


-E  TeG{e'Yi) 

i= 1 


is  an  average  and  so  converges  to  its  expectation,  provided  this  expectation  exists, 
by  the  weak  law  of  large  numbers  (Appendix  G) 


dGn 

dd 


TeG^Y) 


A(D). 


Due  to  consistency,  6n  — >p  6,  and  the  second  term  in  the  denominator  of  (2.11) 
includes  the  average 


d2Gn 

do2 


1  ”  d2 

2=1 


which,  by  the  law  of  large  numbers,  tends  to  its  expectation,  that  is, 


d2Gn 

dO2 


provided  this  average  exists.  Hence,  the  second  term  in  the  denominator  of  (2.1 1) 
converges  in  probability  to  zero  and  so,  by  Slutsky’s  theorem  (Appendix  G) 


yfn  ( 9n 


e )  N 


as  required,  where  we  have  suppressed  the  dependence  of  A(9)  and  15(0)  on  6. 

□ 

In  practice,  A  =  A{6)  and  B  =  B(6)  are  replaced  by  An(6n)  and  Bn(6n), 
respectively,  with  asymptotic  normality  continuing  to  hold  due  to  Slutsky’s  theorem. 

In  the  sections  that  follow  we  describe  a  number  of  approaches  for  constructing 
and  using  estimating  functions.  These  approaches  differ  in  the  number  of  assump¬ 
tions  that  are  required  for  both  specifying  the  estimating  function  and  making 
inference.  At  one  extreme,  in  a  fully  model-based  approach,  a  full  probability 
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distribution  is  specified  for  the  data  and  is  used  to  both  specify  the  estimating 
function  and  to  evaluate  the  expectations  required  in  the  calculation  of  A  and  B. 
At  the  other  extreme,  minimal  assumptions  are  made  on  the  data  to  construct  the 
estimating  function,  and  the  expectations  required  to  evaluate  var(0„)  are  calculated 
empirically  from  the  observed  data  (see  Sect.  2.6). 

In  the  independent  but  not  identically  distributed  case 

[A-'BniAl)-1]-112  (On  ~  0)  Np(0,  Ip),  (2.13) 


where 


An  =  E 
Bn=  E 


[Gn(d)Gn{ey 


=  var  [Gn(G)\ . 


The  previous  independent  and  identically  distributed  situation  is  a  special  case,  with 
An  =  nA  and  B„  =  nB ,  in  which  case  (2.13)  simplifies  to  (2.9). 

The  sandwich  form  of  the  variance  of  6n  in  (2.9)  and  (2.13) — the  covariance  of 
the  estimating  function,  flanked  by  the  expectation  of  the  inverse  of  the  Jacobian 
matrix  of  the  transformation  from  the  estimating  function  to  the  parameter — is  one 
that  will  appear  repeatedly. 

Estimators  derived  from  an  estimating  function  are  invariant  in  the  sense  that 
if  we  are  interested  in  a  function,  <f>  =  g(6),  then  the  estimator  is  <pn  =  g(6n). 
The  delta  method  (Appendix  G)  allows  the  transfer  of  inference  from  the  parameters 
of  the  model  to  quantities  of  interest.  Specifically,  suppose 

yfc(0n-0)^d  Np  [O,V(0)]. 


Then,  by  the  delta  method, 

\g(6n)  -  5(0)1  ^dN[O,ff'(0)V(0);/(0)T], 


where  g'(0)  is  the  I  x  p  vector  of  derivatives  of  </(•)  with  respect  to  elements  of  0. 
For  example,  for  p  =  2 


var  [g{9)\  =  Vn 


2 

+  2V12 


+  V22 


dg 

dd2 


where  Vjk  denotes  the  (j,k) th  element  of  V,  j,k  =  1,2.  Again  in  practice,  9n 
replaces  6  in  var  [g(9)\.  The  accuracy  of  the  asymptotic  distribution  depends  on  the 
parameterization  adopted.  A  rule  of  thumb  is  to  obtain  the  asymptotic  distribution 
for  a  reparameterized  parameter  defined  on  the  real  line;  one  may  then  transform 
back  to  the  parameter  of  interest,  to  construct  confidence  intervals,  for  example. 
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The  implementation  of  a  frequentist  approach  usually  requires  a  maximization  or 
root-finding  algorithm,  but  most  statistical  software  packages  now  contain  reliable 
routines  for  such  endeavors  in  the  majority  of  situations  encountered  in  practice; 
hence,  we  will  rarely  discuss  computational  details  (in  contrast  to  the  Bayesian 
approach  for  which  computation  is  typically  more  challenging). 


2.4  Likelihood 

For  reasons  that  will  become  evident,  likelihood  provides  a  popular  approach  to 
statistical  inference  and  our  coverage  reflects  this.  Let  p(y  \9)  be  a  full  probability 
model  for  the  observed  data  given  a  p  dimensional  vector  of  parameters,  9. 
The  probability  model  for  the  full  data  is  based  upon  the  context  and  all  relevant 
accumulated  knowledge.  The  level  of  belief  in  this  model  will  clearly  be  context 
specific,  and  in  many  situations,  there  will  be  insufficient  information  available 
to  confidently  specify  all  components  of  the  model.  Depending  on  the  confidence 
in  the  likelihood,  which  in  turn  depends  on  the  sample  size  (since  large  n  allows 
more  reliable  examination  of  the  assumptions  of  the  model),  the  likelihood  may  be 
effectively  viewed  as  approximately  “correct,”  in  which  case  inference  proceeds  as 
if  the  true  model  were  known.  Alternatively  the  likelihood  may  be  seen  as  an  initial 
working  model  from  which  an  estimating  function  is  derived;  the  properties  of  the 
subsequent  estimator  may  then  be  determined  under  a  more  general  model. 

Definition.  Viewing  p(y  \  9)  as  a  function  of  9  gives  the  likelihood  function, 
denoted  L{9). 

A  key  point  is  that  L(9)  is  not  a  probability  distribution  in  9 ,  hence  the  name 
likelihood.3 


2.4.1  Maximum  Likelihood  Estimation 

The  value  of  9  that  maximizes  L(9)  and  hence  gives  the  highest  probability 
(density)  to  the  observed  data,  denoted  9.  is  known  as  the  maximum  likelihood 
estimator  (MLE). 

In  Part  II  of  this  book,  we  consider  models  that  are  appropriate  when  the  data  are 
conditionally  independent  given  9  so  that 


p(y  I  o)  =  Y[p(yi  I  e)- 

i=l 


3We  use  the  label  “likelihood”  in  this  section,  but  strictly  speaking  we  are  considering  frequentist 
likelihood,  since  we  will  evaluate  the  frequentist  properties  of  an  estimator  derived  from  the  likeli¬ 
hood.  This  contrasts  with  a  pure  likelihood  view,  as  described  in  Royall  (1997),  in  which  properties 
are  derived  from  the  likelihood  function  alone,  without  resorting  to  frequentist  arguments. 
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For  the  remainder  of  this  chapter,  we  assume  such  conditional  independence  holds. 
For  both  computation  and  analysis,  it  is  convenient  to  consider  the  log-likelihood 
function 

n 

l(0)=logL(0)=^ogp(Yi\0) 

i= 1 


and  the  score  function 


S(0)  = 


81(0) 

de 


81(6)  81(6)  ]T 

d0i  ’ '  ‘ ’  86p 


which  is  the  px  1  vector  of  derivatives  of  the  log-likelihood.  As  we  now  illustrate, 
the  score  satisfies  the  requirements  of  an  estimating  function. 

Definition.  Fisher’s  expected  information  in  a  sample  of  size  n  is  the  p  x  p  matrix 


ln(0)  =  -E 


r  d 2 

dodo1 


ds(ey 

d0T 


Result.  Under  suitable  regularity  conditions, 


B[S(0)]=E 


'or 

80 


=  o, 


(2.14) 


and 


ln(0)  =  -E 


'8S(0)' 

80T 


E[S(0)S(0Y}. 


(2.15) 


Proof.  For  simplicity  we  give  a  prove  for  the  situation  in  which  9  is  univariate, 
and  the  observations  are  independent  and  identically  distributed.  Under  these 
circumstances 


where 


In{0)  =nl1(9), 


h{0)  =  -E 


r  d2 

d.6 2 


log p(Y  |  6) 


The  expectation  of  the  score  is 


E[S(0)]  =£E 

i= 1 


^logp(Yz\9) 


=  nE 


^log p{Y  |  0) 


and,  under  regularity  conditions  that  allow  the  interchange  of  differentiation  and 
integration, 
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^iogP(Y\e) 


—  log p(y  I  9)  )  p(y  I  9)dy 


=  /  Tep{y  1  e),^W)dy  =  1  I ”{y  m  =  °’ 


(2.16) 


which  proves  (2. 14). 
From  (2.16), 


—  logp(y  I  0)  )  p(y  I  0)dy 


=  /  J?  (  J?  logp^  I  d^y  I  6'))  dy 

Jja  logp(y  1 0))  p(y  I  0)*/  +  j  (J#  iogP(y  \  o)^J  (j^piv  I  0))  dy 

Jjp  logp(y  1 0))  p(y  I  e^dy  +  J  {Jiq  l°sp(y  1 0)^  p(y  1 0)rfy 


r  d 2  1 

7d  \2" 

E 

—  log  p(Y  I  9) 

+  E 

U^io) 

which  proves  (2.15). 

Viewing  the  score  as  an  estimating  function, 

1  1  _n 

Gn(0)  =  -5(0)  =  -^-logp(yi|0), 

6  2—1  U 


shows  that  the  MLE  satisfies  Gn(0n)  =  0.  We  have  already  seen  that 
E[G„(0)]  =  -E[S(0)]  =  0, 

n 

and  to  apply  Result  2.1  of  Sect.  2.3,  we  require 


A(0)  =  E 


— G(0,y) 


=  E 


d2 


8989 


log p(Y  I  9) 


□ 


G(0,  Y)G(8,  F)T]  =  E  (jt  log p(Y  \  0))  log p(Y  \  9) 


T-1 
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Equation  (2.15)  shows  that 


IAd)  =  -A(0)  =  B(Q) 

and,  from  (2.12) 

n-1/2S(6)^dN  [0,1^9)].  (2.17) 

From  Result  2.1,  the  asymptotic  distribution  of  the  MLE  is  therefore 

yfii(0n-0)  -+d  Np  [O,/!^)-1]  .  (2.18) 

For  independent,  but  not  necessarily  identically  distributed,  random  variables 
Yu...,Yn, 

In{9)  =  - An(G )  =  Bn(G), 

and 

In(G)1/2(Gn^G)  -»• d  Np(0,lp),  (2.19) 

The  information  is  scaling  the  statistic  and  should  be  growing  with  n  for  the 
asymptotic  distribution  to  be  appropriate.  Intuitively,  the  curvature  of  the  log- 
likelihood,  as  measured  by  the  second  derivative,  determines  the  variability  of  the 
estimator;  the  greater  the  curvature,  the  smaller  the  variance  of  the  estimator.  The 
distribution  of  6n  is  sometimes  written  as 

9n  ->• d  Np  [GJn{G)~l]  , 

but  this  is  a  little  sloppy  since  the  limiting  distribution  should  be  independent  of  n. 
The  variance  of  the  score-based  estimating  function  has  the  property  that  A  =  AT 
because  the  matrix  of  second  derivatives  is  symmetric,  that  is, 

d2l  d2l 
dGjdOk  ~  ddkdGj 


for  j,  k  =  1, . . .  ,p. 

If  there  is  a  unique  maximum,  then  the  MLE  is  consistent  and  asymptotically 
normal.  The  Cramer-Rao  bound  was  given  in  (2.3).  In  the  present  terminology,  for 
any  unbiased  estimator,  G,  the  bound  is  var(0)  >  so  that  the  MLE  is 

asymptotically  efficient.  Asymptotic  efficiency  under  correct  model  specification  is 
a  primary  motivation  for  the  widespread  use  of  MLEs. 

For  inference  via  (2.18),  we  may  also  replace  the  expected  information  by  the 
obsen’ed  information, 

=  ^dGdF1^' 

Asymptotically,  their  use  is  equivalent  since  I*  — >p  /„  as  n  ^  oc  by  the  weak  law 
of  large  numbers  (Appendix  G). 
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The  regularity  conditions  required  to  derive  the  asymptotic  distribution  of  the 
MLE  include  identifiability  so  that  each  element  of  the  parameter  space  0  should 
correspond  to  a  different  model  p(y  \  0),  otherwise  there  would  be  no  unique 
value  of  0  to  which  0  would  converge.  We  require  the  interchange  of  differentiation 
and  integration,  and  so  the  range  of  the  data  cannot  depend  on  an  unknown 
parameter.  Additionally,  the  true  parameter  value  must  lie  in  the  interior  of  the 
parameter  space,  and  the  Taylor  series  expansion  that  was  used  to  determine  the 
asymptotic  distribution  of  6  requires  a  well-behaved  derivative  and  so  the  amount 
of  information  must  increase  with  sample  size.  One  situation  in  which  one  must  be 
wary  is  when  the  number  of  parameters  increases  with  sample  size — this  number 
cannot  increase  too  quickly — see  Exercise  2.6  for  a  model  in  which  this  condition 
is  violated. 

In  Sect.  2.4.3,  we  examine  the  effects  on  inference  based  on  the  MLE  of 
model  misspecification  and,  in  Sects.  2.6  and  2.7,  describe  methods  for  determining 
properties  of  the  estimator  that  do  not  depend  on  correct  specification  of  the  full 
probability  model. 


Example:  Binomial  Likelihood 


For  a  single  observation  from  a  binomial  distribution,  Y  |  p  ~  Binornialfn.  p).  the 
log-likelihood  is 

lip)  =  Y\ogp+  (n-  Y)log(l  -p), 


where  we  omit  the  term  log 
score  is 


because  it  is  constant  with  respect  to  p.  The 


dl  Y  n-Y 
slP )  =  ~r  =  —  -j - > 

dp  p  1  —  p 

and  setting  S{p)  =0  gives  p  =  Y/n.  In  addition 


d2l  Y  n  —  Y 

dp 2  p2  (1  —  p)2’ 


and 


n 

p{  i  -p)' 


We  therefore  see  that  the  amount  of  information  in  the  data  for  p  is  greater  if  p  is 
closer  to  0  or  1.  This  is  intuitively  reasonable  since  the  variance  of  Y  is  np{  1  —  p) 
and  so  there  is  less  variability  in  the  data  (and  hence  less  uncertainty)  if  p  is  close  to 
0  or  1.  The  asymptotic  distribution  of  the  MLE  is 


Vn{p-p)  ->d  N[p,p(l-p)], 
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so  that  an  asymptotic  95%  confidence  interval  for  p  is 


?_i.96xw£(lz^>?+1.96x.m1-p) 


Unfortunately,  the  endpoints  of  this  interval  are  not  guaranteed  to  lie  in  (0,1). 
To  rectify  this  shortcoming,  we  may  parameterize  in  terms  of  the  logit  of  p, 
9  =  \og[p/(\  —  p)\.  We  could  derive  the  asymptotic  distribution  using  the  delta 
method,  but  instead  we  reparameterize  the  model  to  give 

1(9)  =Y9  —  n  log  [1  +  exp(0)] , 

and,  proceeding  as  in  the  previous  parameterization, 

Y 


9  =  log 


n  —  Y 


and 


to  give 


m  = 


n[  1  +  exp(0)]^ 
exp(0) 


VE(9-0)  —id  N  (  9 


exp(0) 


[1  +  exp(0)]: 


An  asymptotic  95%  confidence  interval  for  p  follows  from  transforming  the 
endpoints  of  the  interval  for  9: 


exp  9  —  1.96  x  yvar(9)/n 


exp  9  +  1.96  x  y  var (9)/n 


1  +  exp  9  —  1.96  x  y  var (9)/n  I  1  +  exp  9  +  1.96  x  y  var(0)/ 


The  endpoints  will  be  contained  in  (0,1),  though  9  is  undefined  if  Y  =  0  or  Y  =  n. 


Example:  Lung  Cancer  and  Radon 

Consider  the  model 


Yi  |  (3  ~ind  PoissonOi), 

with  ^  =  Ei  exp(a?i/3),  xt  =  [1  ,x»],  i  =  1, . .  ,,n,  and  /3  =  [/30,/?i  ]T- 

The  probability  distribution  of  y  is 

n  n  n 

^2  yi  log  pi  —  ^2  —  ^2  los 

i=  1  i= 1  i—  1 


p(y  I  P)  =  exp 
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to  give  log-likelihood 


1(13)  =  (3T  T  x]Yj  -  y^E)exp(a^/3) 

i=l  i= 1 

and  2x1  score  vector  (estimating  function) 

81  n 

S(P)  =  gp  =YlxTi  ~  Ei  ^PO*/3)] 

=  *T  [Y  -  n(!3)} ,  (2.20) 

where  x  =  [x\, . . . ,  a^]1,  Y  =  [Yi, . . . ,  Yn]T,  and  fi  =  [fii, . . . ,  /Ltn]T.  The  equation 
S(f3)  =  0  does  not,  in  general,  have  a  closed-form  solution,  but,  pathological 
datasets  aside,  numerical  solution  is  straightforward.  Asymptotic  inference  is 
based  on 

~  (3)  ^ d  N2(0,I2), 

where  the  information  matrix  is 

n 

In(Pn)  =  var (S)  =  ^  x}va v(Yi)xl  =  xTVa:, 

i= 1 

with  V  the  diagonal  matrix  with  elements  var(Yj)  =  Ei  exp(xi/3),  i  =  1, . . . ,  n. 
In  this  case,  the  expected  and  observed  information  coincide.  In  practice,  the 
information  is  estimated  by  replacing  f3  by  j3n.  An  important  observation  is  that 
if  the  mean  is  correctly  specified  the  score,  (2.20)  is  a  consistent  estimator  of  zero, 
and  f3n  is  a  consistent  estimator  of  f3.  In  particular,  if  the  data  do  not  conform  to 
var(Yi)  =  Hi,  we  still  have  a  consistent  estimator,  but  the  standard  errors  will  be 
incorrect. 

For  the  lung  cancer  data,  we  have  n  =  85,  and  the  MLE  is  /3  =  [0.17,  — 0.036]T 
with 

0.0272  -0.95  x  0.027  x  0.0054 

-0.95  x  0.027  x  0.0054  0.00542 

The  estimated  standard  errors  of  /?o  and  /3i  are  0.027  and  0.0054,  respectively, 
and  an  asymptotic  95%  confidence  interval  for  /3i  is  [—0.047,-0.026].  Leaning 
on  asymptotic  normality  is  appropriate  with  the  large  sample  size  here.  A  useful 
inferential  summary  is  an  asymptotic  95%  confidence  interval  for  the  area-level 
relative  risk  associated  with  a  one-unit  increase  in  residential  radon,  which  is 

exp(-0.036±  1.96  x  0.0054)  =  [0.954,0.975]. 

This  interval  suggests  that  the  decrease  in  lung  cancer  incidence  associated  with  a 
one-unit  increase  in  residential  radon  is  between  2.5%  and  4.6%,  though  we  stress 
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that  this  is  an  ecological  (area-level)  analysis,  and  we  would  not  transfer  inference 
from  the  level  of  the  area  to  the  level  of  the  individuals  within  the  areas  (as  discussed 
in  Sect.  1.3.3). 


Example:  Weibull  Model 

The  Weibull  distribution  is  useful  for  the  modeling  of  survival  and  reliability  data 
and  is  of  the  form 


p(y  I  0)  =  OiO^y91  1  exp  \-(02y)81]  , 

where  y  >  0,  9  =  [#i,  92]T  and  6i,  92  >  0.  The  mean  and  variance  of  the  Weibull 
distribution  are 


E[Y  |  6}  =  r(l/0!  +  1 ) / 02 

var (y  I  e )  =  [r(2/0!  + 1)  -  r( \/el  +  i)2]/02, 

where 

pOO 

r(a)  =  /  a:Q_1  exp(— :r)cfa; 

Jo 

is  the  gamma  function.  Therefore,  the  first  two  moments  are  not  simple  functions 
of  9 1  and  92.  With  independent  and  identically  distributed  observations  Y.,,  %  = 
1 , ,n,  from  a  Weibull  distribution  the  log-likelihood  is 

n  n 

m  =n  log  0!  +  n9i  log  92  +  (0i  -  1)  ^  log  Y  -  9 ^  Yf1 , 

i=  1  i= 1 


with  score  equations 

n 

9e2^Y^\og(02Y) 


which  have  no  closed-form  solution  and  are  not  a  function  of  a  sufficient  statistic 
of  dimension  less  than  n.  Hence,  consistency  of  6n,  where  S(0n)  =  0,  cannot  be 
determined  from  consideration  of  the  first  moment  (or  even  the  first  two  moments) 
of  the  data  only,  unlike  the  Poisson  example.  In  particular,  consistency  under  model 
misspecification  cannot  easily  be  determined. 


07  ^ 

=  Wi  =  Tx  +nlog°2  +  5Zlo§Fi  ~ 


i=  1 


2=1 
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2.4.2  Variants  on  Likelihood 

Estimation  via  the  likelihood,  as  defined  by  L(6)  =  p(y  \  0),  is  not  always  univer¬ 
sally  applied.  In  some  situations,  such  as  when  regularity  conditions  are  violated, 
alternative  versions  are  required  to  provide  procedures  that  produce  estimators  with 
desirable  properties.  In  other  situations,  alternative  likelihoods  provide  estimators 
with  better  small  sample  properties,  perhaps  because  nuisance  parameters  are  dealt 
with  more  efficiently.  Unfortunately,  the  construction  of  these  likelihoods  is  not 
prescriptive  and  can  require  a  great  deal  of  ingenuity.  We  describe  conditional, 
marginal,  and  profile  likelihoods. 


Conditional  Likelihood 

Suppose  A  represent  parameters  of  interest,  with  4>  being  nuisance  parameters. 
Suppose  the  distribution  for  y  can  be  factorized  as 

p{y  |  A,  4>)  cx  p(ti  |  t2,  \)p(t2  |  A,  4>),  (2.21) 

where  t\  and  t2  are  statistics,  that  is,  functions  of  y.  Then  inference  for  A  may  be 
based  on  the  conditional  likelihood 

Lc(X)=p(t1  |  *2,  A).  (2.22) 

The  conditional  likelihood  has  similar  properties  to  a  regular  likelihood.  Conditional 
likelihoods  may  be  used  in  situations  in  which  we  wish  to  eliminate  nuisance 
parameters.  The  conditioning  statistic,  t2 ,  is  not  ancillary  (Appendix  F),  so  that  it 
does  depend  on  A,  and  so  some  information  may  be  lost  in  the  act  of  conditioning, 
but  the  benefits  of  elimination  are  assumed  to  outweigh  this  loss.  Conditional 
likelihoods  will  be  used  in  Sect.  7.7  in  the  context  of  Fisher’s  exact  test  and 
individually  matched  case-control  studies  (in  which  the  number  of  parameters 
increases  with  sample  size)  and  in  Sects.  9.5  and  9.13.4  to  eliminate  random  effects 
in  mixed  effects  models. 


Marginal  Likelihood 

Fet  Si,  S 2,  A  be  a  minimal  sufficient  statistic  where  A  is  ancillary  (Appendix  F), 
and  suppose  we  have  the  factorization 

p(y  I  A,  0)  oc  p(si,82,a  I  A,  <j>) 

=  p(a)p(s1  |  a,  A )p(s2  \  si,  a,  A,  <j>) 


2.4  Likelihood 
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where  A  are  parameters  of  interest  and  <f>  are  the  remaining  (nuisance)  parameters.  In 
contrast  to  conditional  likelihood,  marginal  likelihoods  are  based  on  averaging  over 
parts  of  the  data  to  obtain  p(s i  |  a,  A),  though  operationally  marginal  likelihoods 
are  often  derived  without  the  need  for  explicit  averaging. 

Inference  for  A  may  be  based  on  the  marginal  likelihood 

Lm( A)  =  p(s i  |  a,  A) 

and  is  desirable  if  inference  is  simplified  or  if  problems  with  standard  likelihood 
methods  are  to  be  avoided. 

These  advantages  may  outweigh  the  loss  of  efficiency  in  ignoring  the  term  p(s2  | 
Si,  a,  A,  <fi).  If  there  is  no  ancillary  statistic,  then  the  marginal  likelihood  is 


Lm{ A)  =  p(s1  I  A). 


The  marginal  likelihood  has  similar  properties  to  a  regular  likelihood.  We  will  make 
use  of  marginal  likelihoods  for  variance  component  estimation  in  mixed  effects 
models  in  Sect.  8.5.3. 


Example:  Normal  Linear  Model 


Assume  Y  \  (3,  a1  ~  Nn(x(3,  cr2ln)  where  x  is  the  n  x  (k  +  1)  design  matrix 
and  dim  1/3)  =  k  +  1.  Suppose  the  parameter  of  interest  is  A  =  a2,  with  remaining 
parameters  4>  =  (3.  The  MLE  for  a2  is 


~2  1  (  RSS 

CT  =  -(y  -  x(3)  (: y  -  xf3)  =  - 

n  n 


✓N.  2 

with  /3  =  (xTx)~1xTY .  It  is  well  known  that  a  has  finite  sample  bias,  because 
the  estimation  of  (3  is  not  taken  into  account.  The  minimal  sufficient  statistics  are 
si  =  S2  =  RSS/(n  —  k  —  1)  and  S2  =  /3.  We  write  the  probability  density  for  y  in 
terms  of  si  and  S2'. 


p{y  I  ct2./3)  =  (27 TCT2)  n/2 


exp 


-^(y  -  xf3Y(y  -  xf3 ) 


oc  a  n  exp 


--2{n-k-l)s2 


exp 


PYxtx((3  -  (3) 


=  p{si  |  cr2)p(s2  |  (3,  cr2) 


where  going  between  the  first  and  second  line  is  straightforward  if  we  recognize  that 
(y  —  x(3)T(y  —  x/3)  =  (y  —  x/3  +  x(3  —  x(3)T(y  —  x/3  +  x/3  —  x/3) 

=  (y~  x(3)T(y  -  x(3)  +  0-  (3)TxTx(f3  -  (3),  (2.23) 
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with  the  cross  term  disappearing  because  of  independence  between  (3  and  the  vector 
of  residuals  y  —  x(3.  Consequently,  the  marginal  likelihood  is 

Lm(cr2)  =  p(s2  I  a2). 


Since  the  data  are  normal 

(n  —  k  —  l)s2  2 


Xn-k- 1  —  Ga 


n  —  k  —  1  1 


and  so 

p(S2  |  cr2)  = 
to  give 


n  —  k  —  1 


(n-k- 1)/2  ^2^  (n-— fc— 1)/2  — t 


■  exp 


(n  —  fc  —  l)s 


21 


2cr2 


lm  =  log  Lm  =  -(n  -k  -  1)  log  a  - 


(n  —  k  —  l)s2 
2^ 


and  marginal  likelihood  estimator  a2  —  s2,  the  usual  unbiased  estimator. 


Profile  Likelihood 


Profile  likelihood  provides  a  method  of  examining  the  behavior  of  a  subset  of  the 
parameters.  If  Q  =  [A,  </>],  where  A  again  represents  a  vector  of  parameters  of 
interest  and  4>  the  remaining  parameters,  then  the  profile  likelihood  Lp{ A)  for  A 
is  defined  as 

LP(X)  =  maxL(A,  4>).  (2.24) 


If  A  denotes  the  maximum  of  Lp{ A)  and  G 


is  the  MLE,  then  A 


A. 


Profile  likelihoods  will  be  encountered  in  Sect.  8.5,  in  the  context  of  the  estimation 
of  variance  components  in  linear  mixed  effects  models. 


2.4.3  Model  Misspecification 

In  the  following,  we  begin  by  assuming  independent  observations.  We  have  seen 
that  if  the  assumed  model  is  correct  then  the  MLE,  6 ,  has  asymptotic  distribution 


2.4  Likelihood 
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In  this  section  we  examine  the  effects  of  model  misspecihcation.  We  first  determine 
exactly  what  quantity  the  MLE  is  estimating  under  misspecihcation  and  then 
examine  the  asymptotic  distribution  of  the  MLE.  Let  p(y  \  9)  and  pT(y)  denote 
the  assumed  and  true  densities,  respectively. 

The  average  of  the  log-likelihood  is  such  that 


(2.25) 


by  the  strong  law  of  large  numbers.  Hence,  asymptotically  the  MLE  maximizes  the 
expectation  of  the  assumed  log-likelihood  under  the  true  model  and  Gn  — >p  9, .  We 
now  investigate  what  9,  represents  when  we  have  assumed  an  incorrect  model.  We 
write 


Er[logp(F  I  0)]  =  Et  [logpr(y)  -  log Pt(Y)  +  log  p{Y  I  9)} 
=  ET[logpT(F)]  -  KL(pT,p), 


(2.26) 


where 


is  the  Kullback-Leibler  measure  of  the  “distance”  between  the  densities  /  and  g 
(the  measure  is  not  symmetric  so  is  not  a  conventional  distance  measure).  The  first 
term  of  (2.26)  does  not  depend  on  9,  and  so  the  MLE  minimizes  KL(px,p),  and  is 
therefore  that  value  of  9  which  makes  the  assumed  model  closest,  in  a  Kullback- 
Leibler  sense,  to  the  true  model. 

We  let  Sn(9)  denote  the  score  under  the  assumed  model  and  state  the  following 
result,  along  with  a  heuristic  derivation. 

Result.  Suppose  9n  is  a  solution  to  the  estimating  equation  Sn(9 )  =  0,  that 


is,  Sn(9n)  =  0.  Then 


s/n  (9n  —  9T)  -> d  Np  [0,  J~1K(JT)~1] 


(2.27) 


where 


J  =  J(0t)=Et  \^-\0gp(Y\9T) 


and 


K  =  K(9t)  =  Et  (Jg  log  p(Y  |  0T)j  (jt  log  p(Y  |  9T)^j 
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Outline  Derivation 

The  derivation  closely  follows  that  of  Result  2.1,  and  for  simplicity  we  again  assume 
9  is  one-dimensional.  We  first  obtain  the  expectation  and  variance  of 


1  1  71  <7 

~Sn{0)  =  -  log v{Vi  I  0), 
n  n  r- '  da 


in  order  to  derive  the  asymptotic  distribution  of  Sn(9).  Subsequently,  we  obtain  the 
distribution  of  9n. 

Recall  that  9T  is  that  value  which  minimizes  the  Kullback-Leibler  distance, 
that  is, 


0  =  — KL(0) 


*  r* 


My) 

p{y  I  o) 


Pj(y)dy 


-jT.  logpT(y)pT(y)dy  -  /  —  log  p(y  \  9)pT(y)dy 


d9 


d9 


=  0- 


—  \ogy(y  |  9)  )  Mv)dy 


and  so  Ej.[S(0t)]  =  0  (and  we  have  assumed  that  we  can  interchange  the  order  of 
differentiation  and  integration). 

For  the  second  moment. 


1 


i=l 


-J2(d9l0gp{yi]d) 


Et 


—  log  p(Y  |  0T) 


=  K, 


which  we  assume  exists.  Hence,  by  the  central  limit  theorem 

-S(0T)  d  U(0,K). 


Expanding  Sn{9)  in  a  Taylor  series  around  0T: 

0  =  -Sn(9n)  =  —Sn(9T)  +  (0n  -  0T)-  ^ 
n  n  n  ad 


1  (a  a  \2  1  d2Sn 

+  2(0"“0t)  no¬ 


where  9  is  between  9n  and  9T  and 

1  dSn{9) 


d9 


1  n  J2 

=  I  9) 

i=  1 


>P  ET 


^i°g  p(y  I  M 


=  j. 
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Following  the  outline  derivation  of  Result  2.1  gives 


y/n  (0n  —  0T)  -^d  N  , 


as  required. 


Example:  Exponential  Assumed  Model,  Gamma  True  Model 

Suppose  that  the  assumed  model  is  exponential  with  mean  0  but  that  the  true  model 
is  gamma  Ga(a,/3).  Minimizing  the  Kullback-Leibler  distance  with  respect  to  6 
corresponds  to  maximizing  (2.25),  that  is 


a//3 


0 


so  that  6*t  =  a/ (3  is  the  quantity  that  is  being  estimated  by  the  MLE.  Hence,  the 
closest  exponential  distribution  to  the  gamma  distribution,  in  a  Kullback-Leibler 
sense,  is  the  one  that  possesses  the  same  mean. 

2.5  Quasi-likelihood 

2.5.1  Maximum  Quasi-likelihood  Estimation 

In  this  section  we  describe  an  estimating  function  that  is  based  upon  the  mean  and 
variance  of  the  data  only.  Specifically,  we  assume  that  the  first  two  moments  are  of 
the  form 


E(Y  |  f3\  =  /x(/3) 

var(y  \(3)=aV  [p(0)] 


where  p(/3)  =  [pi(P),  •  •  • ,  Atn(/3)]T  represents  the  regression  function,  V  is  a 
diagonal  matrix  (so  the  observations  are  assumed  uncorrelated),  with 


var(Fi  |  (3)  =  aV  [m(P)\ , 


and  a  >  0  is  a  scalar  that  does  not  depend  upon  /3.  We  assume  /3  =  [/3q , . . . ,  /3*]T 
so  that  the  dimension  of  /3  is  fc  +  1.  The  aim  is  to  obtain  the  asymptotic  properties 
of  an  estimator  of  /3  based  on  these  first  two  moments  only.  The  specification  of  the 
mean  function  in  a  parametric  regression  setting  is  unavoidable,  and  efficiency  will 
clearly  depend  on  the  form  of  the  variance  model. 
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To  motivate  an  estimating  function,  consider  the  sum  of  squares 

(Y-nYV-'iY-ri/a,  (2.28) 

where  //  =  /x(/3)  and  V  =  V (/ 3 ).  To  minimize  this  sum  of  squares,  there  are  two 
ways  to  proceed.  Perhaps  the  more  obvious  route  is  to  acknowledge  that  both  //  and 
V  are  functions  of  (3  and  differentiate  with  respect  to  (3  to  give 

-  2 D'V-\Y  -  M)/a  +  (Y  -  -  /*)/«,  (2-29) 

where  D  is  the  nxp  matrix  of  derivatives  with  elements  dpi  /  d(3j ,  i  =  1, ...  ,n,j  = 
1 ,p.  Unfortunately,  (2.29)  is  not  ideal  as  an  estimating  function  because  it  does 
not  necessarily  have  expectation  zero  when  we  only  assume  E[Y  |  (3\  =  p,  because 
of  the  presence  of  the  second  term.  If  the  expectation  of  the  estimating  function  is 
not  zero,  then  an  inconsistent  estimator  of  / 3  results. 

Alternatively,  we  may  temporarily  forget  that  V  is  a  function  of  (3  when  we 
differentiate  (2.28)  and  solve  the  estimating  equation 


D0YV0)-1 


Y  -  M 


/a  =  0. 


As  shorthand  we  write  this  estimating  function  as 


U(f3)  =  DTV~1  (Y  —  fj,)  /a. 


(2.30) 


This  estimating  function  is  linear  in  the  data  and  so  its  properties  are  straightforward 
to  evaluate.  In  particular. 


1.  E [U(f3)\  =  0,  assuming  E[Y  |  f3\  =  p((3). 

2.  var  [U(/3)]  =  DJV~1D/a,  assuming  var(Y  |  (3)  =  V. 

=  DTV~1D /a  =  var  [U(f3)},  assuming  E[Y  |  (3\  =  p{(3). 


3.  -E 


dU 

as 


The  similarity  of  these  properties  with  those  of  the  score  function  (Sect.  2.4.1)  is 
apparent  and  has  led  to  (2.30)  being  referred  to  as  a  quasi-score  function.  Let  (3n 
represent  the  root  of  (2.30),  that  is,  U([3n)  =  0.  We  can  apply  Result  2.1  directly 
to  obtain  the  asymptotic  distribution  of  the  maximum  quasi-likelihood  estimator 
(MQLE)  as 

(DTV~1D)1/2([3n  -  (3)  -»• d  Nfc+1(0,  alfc+r), 


where  we  have  assumed  that  a  is  known.  Using  (B.4)  in  Appendix  B 


E[(Y-M)Ty-1(M)(Y-M)]/a  =  n, 
and  so  if  //  were  known,  an  unbiased  estimator  of  a  would  be 
an  =  (Y  -  p,yV~l(p,)(Y  -  n)/n. 


2.5  Quasi-likelihood 
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A  degree  of  freedom  corrected  (but  not  in  general  unbiased)  estimate  is  given  by  the 
Pearson  statistic  divided  by  its  degrees  of  freedom: 


n  —  k  —  1  z-— ' 

i=  1 


( Yj  -  /tj)2 

V^) 


(2.31) 


where  ^  =  /j,(/3).  This  estimator  of  the  scale  parameter  is  consistent  so  long  as 
the  assumed  variance  model  is  correct.  The  asymptotic  distribution  that  is  used  in 
practice  is  therefore 

(D'V-'D/an^iPn-p)  -> d  Nfe+1(0,Ifc+1). 


The  inclusion  of  an  estimate  for  a  is  justified  by  applying  Slutsky’s  theorem 
(Appendix  G)  to  an  x  U(/3n).  As  usual  in  such  asymptotic  calculations,  the 
uncertainty  in  an  is  not  reflected  in  the  variance  for  /3n.  This  development  reveals 
a  mixing  of  inferential  approaches  with  (3n  a  MQLE  and  cin  a  method  of  moments 
estimator.  A  justification  for  the  latter  estimator  is  that  it  is  likely  to  be  consistent 
in  a  wider  range  of  circumstances  than  a  likelihood-based  estimator.  A  crucial 
observation  is  that  if  the  mean  function  is  correctly  specified,  the  estimator  f3n 
is  consistent  also.  Asymptotically  appropriate  standard  errors  result  if  the  mean- 
variance  relationship  is  correctly  specified.  McCullagh  (1983)  and  Godambe  and 
Heyde  (1987)  discuss  the  close  links  between  consistency,  the  quasi-score  function 
(2.30),  and  membership  of  the  exponential  family;  see  also  Chap.  6. 

As  an  aside,  in  the  above,  the  mean  model  does  not  need  to  be  “correct”  since 
we  are  simply  estimating  a  specified  form  of  association,  and  estimation  will  be 
performed  regardless  of  whether  this  model  is  appropriate.  Of  course,  the  usefulness 
of  inference  does  depend  on  an  appropriate  mean  model. 

As  a  function  of  y,  we  have  the  quasi-score 


Y-H 

aVinY 


(2.32) 


and  integration  of  this  quantity  gives 


y-t 
aV  {t) 


dt, 


which,  if  it  exists,  behaves  like  a  log-likelihood.  As  an  example,  for  the  model 
E[Y]  =  n  and  var(Y)  =  ay 


lin,a) 


y  —  £  l 

‘——dt  =  -[y  log /i 
at  a 


y  +  c\, 


where  c  =  —y  log  y  —  y  and  y  log  y  —  y  is  the  log-likelihood  of  a  Poisson  random 
variable.  Table  2.1  lists  some  distributions  that  correspond  to  particular  choices  of 
variance  function. 


52 


2  Frequentist  Inference 


Table  2.1  Variance  functions  and  quasi  log-likelihoods 


Variance  V(fi) 

Quasi  log  likelihood 

Distribution 

1 

N(/U,  a) 

P 

i  ( y  log  fi-fi) 

Poisson  (fi) 

M2 

T  (-^-log/r) 

Ga(l/a,  fi/a) 

n/r(l  -  fi) 

l  +nl°g(i-^)] 

Binomial(n,  fi) 

fi  +  fj? /b 

H  [^(t+ir)  +M°g(6Til)] 

NegBin(/r,  b),  b  known 

P2(l-P)2 

h  [(Sy-Uiogf^) 

No  distribution 

In  all  cases  E[V]  =  /t.  The  parameterizations  of  the  distributional  forms  are  as  in 
Appendix  D.  For  the  Poisson,  binomial,  and  negative  binomial  distributions,  these  are 
the  forms  that  the  quasi-score  corresponds  to  when  a  =  1 


The  word  “quasi”  refers  to  the  fact  that  the  score  may  or  not  correspond  to  a 
probability  function.  For  example,  in  Table  2.1,  the  variance  function  —  /;)'2 
does  not  correspond  to  a  probability  distribution.  In  most  cases,  there  is  an  implied 
distributional  kernel,  but  the  addition  of  the  variance  multiplier  a  often  produces  a 
mean-variance  relationship  that  is  not  present  in  the  implied  distribution. 

We  emphasize  that  the  first  two  moments  do  not  uniquely  define  a  distribution. 
For  example,  the  negative  binomial  distribution  may  be  derived  as  the  marginal 
distribution  of 


Y  |  /x,  9  ~  Poisson(/i0) 
0  ~  Ga(6,  b ) 


(2.33) 

(2.34) 


so  that  E[Y]  =  /i  and 

LL2 

var(Y)  =  E[var(T  |  8)}  +  var(E[y  |  0])  =  /z  +  .  (2.35) 

These  latter  two  moments  are  also  recovered  if  we  replace  the  gamma  distribution 
with  a  lognormal  distribution.  Specifically,  assume  the  model 


Y  |  8*  ~  Poisson(6>*) 

8*  ~  LogNorm(?7,  cr2) 
and  let  fj,  =  E[0]  =  exp (j)  +  <t2/2).  Then, 


var(6>*)  =  E[(9*]2  [exp(cr2)  -  l]  =  M2  [exp(cr2)  -  l]  . 


Under  this  model,  E[V]  =  /i  and 

var(y)  =  E[var(y  |  8*)]  +  var[E(y  |  0*)]  =  /j,  +  /j2  [exp(cr2)  —  l] 


2.5  Quasi-likelihood 


53 


which,  on  writing  b *  =  [exp(cr2)  —  1]_1,  gives  the  same  form  of  quadratic  variance 
function,  (2.35),  as  with  the  gamma  model. 

If  the  estimating  function  (2.30)  corresponds  to  the  score  function  for  a  particular 
probability  distribution,  then  the  subsequent  estimator  corresponds  to  the  MLE 
(because  a  does  not  influence  the  estimation  of  0),  though  the  variance  of  the 
estimator  will  usually  differ.  A  great  advantage  of  the  use  of  quasi-likelihood  is 
its  computational  simplicity. 

A  prediction  interval  for  an  observable,  Y,  is  not  possible  with  quasi-likelihood 
since  there  is  no  probabilistic  mechanism  with  which  to  reflect  the  stochastic 
component  of  the  prediction. 


Example:  Lung  Cancer  and  Radon 

We  return  to  the  lung  cancer  example  and  now  assume  the  quasi-likelihood  model 

E [Yi  |  0}  =  Ei  exp(xi/3),  var (Y*  |  (3)  =  aE[Yi  \  0], 

Fitting  this  model  yields  identical  point  estimates  to  the  MLEs  and  a  =  2.81  so  that 
the  quasi-likelihood  standard  errors  are  v/S  =  1.68  times  larger  than  the  Poisson 
model-based  standard  errors.  The  variance-covariance  matrix  is 

0.0452  -0.95  x  0.045  x  0.0090' 

-0.95  x  0.045  x  0.0090  0.00902 

An  asymptotic  95%  confidence  interval  for  the  relative  risk  associated  with  a  one- 
unit  increase  in  radon  is  [0.947,  0.982]  which  is  \fa  =  1.68  wider  than  the  Poisson 
interval  evaluated  previously. 


(. DTV~1D)~1a  = 


2.5.2  A  More  Complex  Mean-Variance  Model 

For  comparison,  we  now  describe  a  more  general  model  than  considered  under  the 
quasi-likelihood  approach.  Suppose  we  specify  the  first  two  moments  of  the  data  as 

E[F,  |  f3\  =  mtf)  (2.36) 

var(Yi  |  0)  =  Vi(a,/3),  (2.37) 

where  a  is  an  r  x  1  vector  of  parameters  that  appear  only  in  the  variance  model.  Let 
an  be  a  consistent  estimator  of  a.  We  state  without  proof  the  following  result.  The 
estimator  0n  that  satisfies  the  estimating  equation 
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G(0n,an)  =  D0n)V  1(an,f3n) 


Y  -  ll(Pn) 


(2.38) 


has  asymptotic  distribution 

(DTV~1D)1^2(pn  -  (3)  -4 d  Nfc+1(0,lfc+1) 

where  D  =  D{fln)  and  V  =  V(an:(3n). 

The  difference  between  this  model  and  that  in  the  quasi-likelihood  approach  is 
that  V  may  now  depend  on  additional  variance-covariance  parameters  a  in  a  more 
complex  way.  Under  quasi-likelihood  it  is  assumed  that  var(Yi)  =  a  V',  ( /./,  ).  so  that 
the  estimating  function  does  not  depend  on  a.  Consequently,  (3  also  does  not  depend 
on  a,  though  the  standard  errors  are  proportional  to  y/a.  This  is  a  motivating  factor 
in  the  development  of  quasi-likelihood,  since  standard  software  may  be  used  for 
implementation  and,  perhaps  more  importantly,  consistency  of  f3  is  guaranteed  if 
the  mean  model  is  correctly  specified. 

The  form  of  the  mean-variance  relationship  given  by  (2.36)  and  (2.36)  suggests 
an  iterative  scheme  for  estimation  of  f3  and  o;.  Set  t  =  0  and  let  be  an  initial 
estimate  for  a.  Now  iterate  between 

1.  Solve  G(/3,  a®)  =  0  to  give  /3^t+1\ 

2.  Estimate  a;(t+1)  with  m  =  /x,  ^/3^+1^ .  Set  t  — >  t  +  1  and  return  to  1. 

The  model  given  by  (2.36)  and  (2.36)  is  more  flexible  than  that  provided  by  quasi¬ 
likelihood  but  requires  the  correct  specification  of  mean  and  variance  for  a  consistent 
estimator  of  (3. 


Example:  Lung  Cancer  and  Radon 


As  an  example  of  the  mean-variance  model  discussed  in  the  previous  section,  we 
fit  a  negative  binomial  model  to  the  lung  cancer  data.  This  model  is  motivated  via 
the  random  effects  formulation  given  by  (2.33)  and  (2.34)  with  loglinear  model 
Hi  =  mil 3)  =  Ei  exp(/30  +  PiXi),  i  =  1, . . . ,  n.  In  the  lung  cancer  context,  the 
random  effects  are  area-specific  perturbations  from  the  mean  m.  The  introduction  of 
the  random  effects  may  be  seen  as  a  device  for  inducing  overdispersion.  Integrating 
over  6i,  we  obtain  the  negative  binomial  distribution 


Pr (yi  |  (3,  b ) 


r(yi  +  b)  Hfbb 
r(b)Vl\  {m  +  b)y>+b' 


for  yi  =  0, 1, 2, ...,  with 


1  + 


Pi(l 3) 


m  I  P\  =  IU(0) 

var(Y'i  |/3,6)  =  m{0) 


b 


(2.39) 
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so  that  smaller  values  of  b  correspond  to  greater  degrees  of  overdispersion  and 
as  b  — >  oo  we  recover  the  Poisson  model.  For  consistency  with  later  chapters 
we  use  b  rather  than  a  for  the  parameter  occurring  in  the  variance  model.  Care 
is  required  with  the  negative  binomial  distribution  since  a  number  of  different 
parameterizations  are  available;  see  Exercise  2.4.  The  log-likelihood  is 


i(P,b)  = 

i=  1 


r(yi  +  b) 
r(b)Vl\ 


+  yi  log m  +  blogb-  {yi  +  b)  log (yt  +  b) 


(2.40) 


giving  the  score  function  for  (3  as 


S((3) 


dl  _  /  dfii  \  yi  —  ^ 

n 


which  corresponds  to  (2.38).  Hence,  for  fixed  b,  we  can  solve  this  estimating 
equation  to  obtain  an  estimator  (3.  Usually  we  will  also  wish  to  estimate  b  (as  op¬ 
posed  to  assuming  a  fixed  value).  One  possibility  is  maximum  likelihood  though 
a  quick  glance  at  (2.40)  reveals  that  no  closed-form  estimator  will  be  available 
and  numerical  maximization  will  be  required  (which  is  not  a  great  impediment). 
We  describe  an  alternative  method  of  moments  estimator  which  may  be  more  robust. 

For  the  quadratic  variance  model  (2.39),  the  variance  is 

var (Yi  |  (3,  b)  =  E [(Yi  -  m)2}  =  +  m/b), 


so  that 


b-1  =  E 


(Yi  -  yi)2  -  ^ 


for  i  =  1, ...  ,n,  leading  to  the  method  of  moments  estimator 

1  tv,  -  nA'2  -  A  n  1 


1  (Yi  —  yi)2  —  yi 

-  F  -  1  ~ 


Pi 


(2.41) 


with  k  =  1  in  the  lung  cancer  example.  If  we  have  a  consistent  estimator  b  (which 
follows  if  the  quadratic  variance  model  is  correct)  and  the  mean  correctly  specified, 
then  valid  inference  follows  from 


(DTV(b)-1D)1/2(j3  -  (3)  -> d  N2(0, 12). 


We  fit  this  model  to  the  lung  cancer  data.  The  estimates  (standard  errors)  are  /3q  = 
0.090  (0.047)  and  /3i  =  —0.030  (0.0085).  The  latter  point  estimate  differs  a  little 
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Fig.  2.2  Linear  and 
quadratic  variance  functions 
for  the  lung  cancer  data 


from  the  MLE  (and  MQLE)  of  —0.036,  reflecting  the  different  variance  weighting 
in  the  estimating  function.  The  moment-based  estimator  was  b  =  57.8  (the  MLE 
is  61.3  and  so  close  to  this  value).  An  asymptotic  95%  confidence  interval  for  the 
relative  risk  exp^)  is  [0.955,0.987],  so  that  the  upper  limit  is  closer  to  unity  than 
the  intervals  we  have  seen  previously. 

In  terms  of  the  first  two  moments,  the  difference  between  quasi-likelihood 
and  the  negative  binomial  model  is  that  the  variances  are,  respectively,  linear  and 
quadratic  functions  of  the  mean.  In  Fig.  2.2,  we  plot  the  estimated  linear  and 
quadratic  variance  functions  over  the  range  of  the  mean  for  these  data.  To  produce  a 
clearer  plot,  the  log  of  the  variance  is  plotted  against  the  log  of  the  mean,  and  the  log 
of  the  observed  counts,  yt,i=  1, . . . ,  85,  is  added  to  the  plot  (with  a  small  amount 
of  jitter).  Over  the  majority  of  the  data,  the  two  variance  functions  are  similar,  but 
for  large  values  of  the  mean  in  particular,  the  variance  functions  are  considerably 
different  which  leads  to  the  differences  in  inference,  since  large  observations  are 
being  weighted  very  differently  by  the  two  variance  functions.  Based  on  this  plot, 
we  might  expect  even  greater  differences.  However,  closer  examination  of  the  data 
reveals  that  the  x’s  associated  with  the  large  y  values  are  all  in  the  midrange,  and 
consequently,  these  points  are  not  influential. 

Examination  of  the  residuals  gives  some  indication  that  the  quadratic  mean- 
variance  model  is  more  appropriate  for  these  data  (see  Sect.  6.9).  It  is  typically  very 
difficult  to  distinguish  between  the  two  models,  unless  there  are  sufficient  points 
across  a  large  spread  of  mean  values. 


2.6  Sandwich  Estimation 

A  general  method  of  avoiding  stringent  modeling  conditions  when  the  variance  of 
an  estimator  is  calculated  is  provided  by  sandwich  estimation.  Recall  from  Sect.  2.3 
the  estimating  function 
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1 

Gn{0)  =  -YjG{d,Yi). 

i=  1 

Based  on  independent  and  identically  distributed  observations,  we  have  the  sand¬ 
wich  form  for  the  variance 


where 


and 


var(0„) 


n 


A  =  E 


oeG{s’Y) 


(2.42) 


B  =E[G(6,Y)G{dAr)T}. 

For  (2.42)  to  be  asymptotically  appropriate,  the  expectations  need  to  be  evaluated 
under  the  true  model  (as  discussed  in  Sect.  2.4.3). 

So  far  we  have  used  an  assumed  model  to  calculate  the  expectations.  An 
alternative  is  to  evaluate  A  and  B  empirically  via 


1  r) 

i= 1 


and 

i  " 

Bn  =  -YjG{d,Yi)G{0,Yiy. 
n  z— ' 

2=1 

By  the  weak  law  of  large  numbers,  An  — >p  A  and  Bn  — B ,  and 


var  {6n) 


A~1B{AT)~1 

n 


(2.43) 


is  a  consistent  estimator  of  the  variance.  The  great  advantage  of  sandwich  estimation 
is  that  it  provides  a  consistent  estimator  of  the  variance  in  very  broad  situations.  An 
important  assumption  is  that  the  observations  are  uncorrelated  (this  will  be  relaxed 
in  Part  III  of  the  book  when  generalized  estimating  equations  are  described). 

We  now  consider  the  situation  in  which  the  estimating  function  arises  from  the 
score  and  suppose  we  have  independent  and  identically  distributed  data.  In  this 
situation 


Gn{G) 


2=1 


with  li(0)  =  log p(Yi  |  9),  to  give 


n 


Ee 


r  d 2 

dedOY 
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and 


n 


E* 

1=1 


Yem) 


where  1(9)  =  logp(U  |  6).  Then,  under  the  model. 


I1(9)  =  -A(9)  =  B(9), 


so  that 

„(§,,)  =  =  hWT 

n  n 

The  sandwich  estimator  (2.43)  is  based  on 


E 


d 2 

d9d6T 


h(0) 


e 


(2.44) 


and 


The  sandwich  method  can  be  applied  to  general  estimating  functions,  not  just  those 
arising  from  a  score  equation  (in  Sect.  2.4.3,  we  considered  the  latter  in  the  context 
of  model  misspecification). 

Suppose  we  assume  E[lj]  =  //,  and  var(Ti)  =  aV(ni),  and  co v(Yi,Yj)  =  0, 
i,  j  =  1, . . . ,  n,  i  ^  j,  as  a  working  covariance  model.  Under  this  specification,  it  is 
natural  to  take  the  quasi-score  function  (2.30)  as  an  estimating  function,  and  in  this 
case,  the  variance  of  the  resultant  estimator  is 


vars(/3„)  =  (DTV~1  D)-1  DTV'-1var(Y)V~1  D(DTV~1  D)-1 . 


The  appropriate  variance  is  obtained  by  substituting  in  the  correct  form  for  var(Y). 
The  latter  is,  of  course,  unknown  but  a  simple  “sandwich”  estimator  of  the  variance 
is  given  by 

var(/ 3„)  =  (DTV~1  D)-1  DTV-1dmg(RRT)V~1  D(DTV~1  D)-1 , 


where  R  =  [f?1; . . . ,  is  the  n  x  1  vector  of  (unstandardized)  residuals 


Ri=Yi-  im0), 


so  that  diag ( RRT )  is  the  nxn  diagonal  matrix  with  diagonal  elements 


M/3) 


2 


for  i  =  1  This  estimator  is  consistent  for  the  variance  of  /3,  under  correct 
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Table  2.2  Components  of  estimation  under  the  assumption  of  independent  outcomes  and  for 
one-dimensional  ft 


Likelihood 


Quasi-likelihood 


G09)  =  Ei  GiW) 

^  =  E,e[^] 

A-V'  d°i  I 

—  Z^i  a/3  I  | 

B  =  EiE[Gi(/3)2] 

S  =  ZiGi@)a 

Model-based  variance 

Sandwich  variance 


2^i  a/3  log 
E/E^logL,] 

Ei  ^  log  Li 

E/E  log 
E/  (^logEi) 
{EiE[|^logLs] 

loS  Li) 

- o - T? 


Y 


Ei  ^4  log  Li 

dp2  B  ' 


lr  /  9/H  Yj-Pi 
a  y  d/3  J  Vi 

_L  V'  ( 2_ 

oc  2-~ji  y  d/3  J  Vi 

_L  V1  (  J_ 

ot  Z—'i  y  d/3  J  Vi 

lv  /  Dm  'i 2  j_ 
a  Z—ii  y  dp  J  Vi 

J_  V  ( dVj\2  (y»~Ai) 
\  8P  J 

»{£■(&)*} 
MW 


vi 

-1 


(9m.\2  i 
og  0^  )  Vt 


The  likelihood  model  is  p(y  \  ft)  =  Yl.Li(ft),  and  the  quasi-likelihood  model  has 
E[Yi  |  ft\  =  Pi{ft),  var (Yt  \  ft)  =  aVi(ft),  i  =  1, ...  ,n,  and  co v(Yi,Yj  \  ft)  =  0,  i  ^  j.  The 
expected  information  is  —  E/  E  ^^2  log  Li  j ,  and  the  observed  information  is  —  Ei  ^2  log  L/. 
The  sandwich  estimator  is  A~1BA~ 1  which  simplifies  to  —  A^1  under  the  model 


specification  of  the  mean,  and  with  uncorrelated  data.  There  is  finite  sample  bias  in 
Ri  as  an  estimate  of  Y,  Pi(/3)  and  versions  that  adjust  for  the  estimation  of  the 
parameters  (3  are  available;  see  Kauermann  and  Carroll  (2001). 

The  great  advantage  of  sandwich  estimation  is  that  it  provides  a  consistent 
estimator  of  the  variance  in  very  broad  situations  and  the  use  of  the  empirical 
residuals  is  very  appealing.  There  are  two  things  to  bear  in  mind  when  one  considers 
the  use  of  the  sandwich  technique,  however.  The  first  is  that,  unless  the  sample 
size  is  sufficiently  large,  the  sandwich  estimator  may  be  highly  unstable;  in  terms 
of  mean  squared  error,  model-based  estimators  may  be  preferable  for  small-  to 
medium-sized  n  (for  small  samples  one  would  want  to  avoid  the  reliance  on  the 
asymptotic  distribution  anyway).  Consequently,  empirical  is  a  better  description  of 
the  estimator  than  robust.  The  second  consideration  is  that  if  the  assumed  mean- 
variance  model  is  correct,  then  a  model-based  estimator  is  more  efficient. 

In  many  cases,  quasi-likelihood  with  a  model-based  variance  estimate  may  be 
viewed  as  an  intermediary  between  the  full  model  specification  and  sandwich 
estimation,  in  that  the  form  of  the  variance  function  separates  estimation  of  f3  and  a, 
to  give  consistency  of  f3  in  broad  circumstances,  though  the  standard  error  will  not 
be  consistently  estimated  unless  the  variance  function  is  correct.  Table  2.2  provides 
a  summary  and  comparison  of  the  various  elements  of  the  likelihood  and  quasi¬ 
likelihood  methods,  with  sandwich  estimators  for  each. 
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Example:  Poisson  Mean 


We  report  the  results  of  a  small  simulation  study  to  illustrate  the  efficiency- 
robustness  trade-off  of  variance  estimation.  Data  were  simulated  from  the  model 
Yj  |  8  Poisson(<5),  i  =  1, . . . ,  n,  where  S  ~ud  Gamma(06,  b).  This  setup  gives 
marginal  moments 


E  \Yi\  =  6 

var(Yj)  =  E[Yi\  x  ^1  +  ^  =  E [Yj]  x  a. 

We  take  9  =  10  and  a  =  1,  2, 3  corresponding  to  no  excess-Poisson  variability,  and 
variability  that  is  two  and  three  times  the  mean.  We  estimate  6  and  then  form  an 
asymptotic  confidence  interval  based  on  a  Poisson  likelihood,  quasi-likelihood,  and 
sandwich  estimation. 

For  a  univariate  estimator  6  arising  from  a  generic  estimating  function  G(91  Y): 
Vn(9  —  9 )  — >d  N  (o,  -^2^  . 


where 


r  d 2  1 

V  d  \2’ 

It’¬ 

ll 

tn 

[wM 

,  B  =  E 

Under  the  Poisson  model 

k{9)  =  -9  +  Yi  log  9 

and 


G{9,Yl)  =  sm  =  fe  =  Y^- 

d^k=_Yi 

d9 2  6»2  ’ 


to  give  the  familiar  MLE,  9  =  Y.  As  we  already  know 


h{9)  =  —A  =  — E 


'd2V 
d9 2 


B  =  var 


(^) 


var(U)  1 
92  =  6 ’ 


under  the  assumption  that  varf  Y)  =  9.  The  Poisson  model-based  variance  estimator 
is  therefore 

_  -  1  F 

var(f?)  =  - —  =  — . 

nl\{9)  n 

Under  the  Poisson  model,  the  variance  equals  the  mean,  and  given  the  efficiency  of 
the  latter,  it  makes  sense  to  estimate  the  variance  by  the  sample  mean. 
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The  quasi-likelihood  estimator  is  derived  from  the  quasi-score 

G{9,Yi)  =  Ui(6)  =  ?^, 


and 


var(6>)  = 

where  the  scale  parameter  is  estimated  using  the  method  of  moments 

i  ^(Y.-e)2 

a  =  -  >  - — - . 

n~l h  o 


The  quasi-likelihood  estimator  of  the  variance  is 


var(0)  = 


s“ 

n 


where 


ef. 


For  sandwich  estimation  based  on  the  score 


and 


=  -i, 

ni~io2  y 

g  ..  .  f  Y  (Yi-ff)2  _  (n-iy 

n  e 2  nd2 


Flence, 


var  (9) 


s2(n  —  1  )/n 
n 


(2.45) 


Estimation  of  var(li)  by  (Yt  —  Y)2  produces  the  variance  estimator  (2.45). 
Estimating  var(yi)  by  n(F,  —  Y)2/(n  —  1)  would  reproduce  the  degrees  of  freedom 
adjusted  quasi-likelihood  estimator. 

Table  2.3  gives  the  95%  confidence  interval  coverage  for  the  model-based,  quasi¬ 
likelihood,  and  sandwich  estimator  variance  estimates  as  a  function  of  the  sample 
size  n  and  overdispersion/scalar  parameter  a.  We  see  that  when  the  Poisson  model 
is  correct  ( a  =  1),  the  model-based  standard  errors  produce  accurate  coverage  for 
all  values  of  n.  For  small  n,  the  quasi-likelihood  and  sandwich  estimators  have  low 
coverage,  due  to  the  instability  in  variance  estimation,  with  sandwich  estimation 
being  slightly  poorer  in  performance.  As  the  level  of  overdispersion  increases,  the 
performance  of  the  model-based  approach  starts  to  deteriorate  as  the  standard  error 
is  underestimated,  resulting  in  low  coverage.  For  a  =  2, 3,  the  quasi-likelihood  and 
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Table  2.3  Percent  confidence  interval  coverage  for  the  Poisson  mean  example,  based  on  100,000 
simulations 


Overdispersion 
a  =  1 

o  =  2 

a  =  3 

n 

Model 

Quasi 

Sand 

Model 

Quasi 

Sand 

Model 

Quasi 

Sand 

5 

95 

87 

84 

83 

87 

84 

74 

86 

83 

10 

94 

92 

90 

83 

91 

90 

73 

91 

89 

15 

95 
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The  nominal  coverage  is  95%.  The  overdispersion  is  given  by  a  =  var(T)/E[Y] 


sandwich  estimators  again  give  low  coverage  for  small  values  of  n,  due  to  instability, 
but  for  larger  values,  the  coverage  quickly  improves.  The  adjusted  degrees  of 
freedom  used  by  quasi-likelihood  give  slightly  improved  estimation  over  the  naive 
sandwich  estimator. 

This  example  shows  the  efficiency-robustness  trade-off.  If  the  model  is  correct 
(which  corresponds  here  to  a  =  1),  then  the  model-based  approach  performs 
well.  The  sandwich  and  quasi-likelihood  approaches  are  more  robust  to  variance 
misspecification,  but  can  be  unstable  when  the  sample  size  is  small.  The  choice  of 
which  variance  model  to  use  depends  crucially  on  our  faith  in  the  model.  The  use  of 
a  Poisson  model  is  a  risky  enterprise,  however,  since  it  does  not  contain  an  additional 
variance  parameter. 


Example:  Lung  Cancer  and  Radon 

Returning  to  the  lung  cancer  and  radon  example,  we  calculate  sandwich  standard 
errors,  assuming  that  counts  in  different  areas  are  uncorrelated.  We  take  as  “working 
model”  a  Poisson  likelihood,  with  maximum  likelihood  estimation  of  (3.  The 
estimating  function  is 

S(P)  =  DTV~1(Y  -p)  =  xT(Y-tx), 

as  derived  previously,  (2.20).  Under  this  model 

{A~1BAT)1^20n  —  (3)  -4 d  N2(0, 12), 

with  sandwich  ingredients 

A  =  DTV~1D 

B  =  r>TF_1var(y)V_1£>, 
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estimators 


A  =  D^V^D 


B  =  DW-1 


~a{  0  •••  0 
0  a\  ■  ■  •  0 


V~lD 


(7 


2 

n  - 


and  with  af  =  (Yt  —  jEij )2,  for  i  =  1 ,n.  Substitution  of  the  required  data 
quantities  yields  the  variance-covariance  matrix 


0.0432  -0.87  x  0.043  x  0.0080 

-0.87  x  0.043  x  0.0080  0.00802 


The  estimated  standard  errors  of  /?o  and  /3±  are  0.043  and  0.0080,  respectively,  and 
are  60%  and  49%  larger  than  their  likelihood  counterparts,  though  slightly  smaller 
than  the  quasi-likelihood  versions.  An  asymptotic  95%  confidence  interval  for  the 
relative  risk  associated  with  a  one-unit  increase  in  radon  is  [0.949,  0.980]. 

We  have  a  linear  exponential  family  likelihood  and  so  a  consistent  estimator 
of  the  loglinear  association  between  lung  cancer  incidence  and  radon,  as  is  clear 
from  (2.20).  If  the  outcomes  are  independent,  then  a  consistent  sandwich  variance 
estimator  is  obtained  and  the  large  sample  size  indicates  asymptotic  inference  is 
appropriate.  However,  in  the  context  of  these  data,  independence  is  a  little  dubious  as 
we  may  have  residual  spatial  dependence,  particularly  since  we  have  not  controlled 
for  confounders  such  as  smoking  which  may  have  spatial  structure  (and  hence 
will  induce  spatial  dependence).  Sandwich  standard  errors  do  not  account  for  such 
dependence  (unless  we  can  lean  on  replication  across  time).  In  Sect.  9.7,  we  describe 
a  model  that  allows  for  residual  spatial  dependence  in  the  counts.  Although  the 
loglinear  association  is  consistently  estimated,  this  of  course  says  nothing  about 
causality  or  about  the  appropriateness  of  the  mean  model. 


2.7  Bootstrap  Methods 

With  respect  to  estimation  and  hypothesis  testing,  the  fundamental  frequentist 
inferential  summary  is  the  distribution  of  an  estimator  under  hypothetical  repeated 
sampling  from  the  distribution  of  the  data.  So  far  we  have  concentrated  on  the  use 
of  the  asymptotic  distribution  of  the  estimator  under  an  assumed  model,  though 
sandwich  estimation  (and  to  a  lesser  extent  quasi-likelihood)  provided  one  method 
by  which  we  could  relax  the  reliance  on  the  assumed  model.  The  bootstrap  is  a 
computational  technique  for  alleviating  some  forms  of  model  misspecification.  The 
bootstrap  may  also  be  used,  to  some  extent,  to  account  for  a  “non-asymptotic” 
sample  size.  We  first  describe  its  use  in  single  parameter  settings  before  moving 
to  a  regression  context. 
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2. 7.1  The  Bootstrap  for  a  Univariate  Parameter 


Suppose  Yi, . . .  ,Yn,  are  an  independent  and  identically  distributed  sample  from  a 
distribution  function  F  that  depends  on  a  univariate  parameter  9.  Let  9(Y)  represent 
an  estimator  of  9.  We  may  be  interested  in  estimation  of 

(i)  varF[?(Y)] 

(ii)  PrF[a  <  9(Y)  <  b\ 

where  we  have  emphasized  that  these  summaries  are  evaluated  under  the  sampling 
distribution  of  the  data  F.  Estimation  of  (i)  is  of  particular  interest  if  the  sampling 
distribution  of  9  is  approximately  normal,  in  which  case  a  100(1  —  a)%  confidence 
interval  is 

0{Y)  +  biasF  9{Y)  ±  z1_a/2\fvas^)  (2.46) 

where  biasF  9(Y)  is  the  bias  of  the  estimator,  and  2i_a/2  the  (1  —  a/2)  quantile 
of  an  N(0, 1)  random  variable.  More  generally,  interest  may  focus  on  a  function  of 
interest  T(F). 

The  bootstrap  is  an  idea  that  is  so  simple  it  seems,  at  first  sight,  like  cheating 
but  it  turns  out  to  be  statistically  valid  in  many  circumstances,  so  long  as  care 
is  taken  in  its  implementation.  The  idea  is  to  first  draw  B  bootstrap  samples 
of  size  n,  Yt*  =  [Yft* , . . . ,  Yf  J,  b  =  1, . . . ,  B,  from  an  estimate  of  F,  F .  In 
the  nonparametric  bootstrap,  the  estimate  of  F  is  Fn,  the  empirical  estimate 
of  the  distribution  function  that  places  a  mass  of  1/n  at  each  of  the  observed 
Yi,  i  =  1  Bootstrap  samples  are  obtained  by  sampling  a  new  dataset 

Yf* ,  %  =  1 , ,n,  from  Fn,  with  replacement .  If  one  has  some  faith  in  the  assumed 
model,  then  F  may  be  based  upon  this  model,  which  we  call  Ffj  where  9  =  9(y), 
to  give  a  second  implementation.  In  this  case,  bootstrap  samples  are  obtained  by 
sampling  V/* ,  i  =  1, . . . ,  n,  as  independent  and  identically  distributed  samples  from 
Fg,  to  give  a  parametric  bootstrap  estimator. 

Intuitively,  we  are  replacing  the  distribution  of 


9, r  —  9 


with 


91  - 


Much  theory  is  available  to  support  the  use  of  the  bootstrap;  early  references  are 
Bickel  and  Freedman  (1981)  and  Singh  (1981);  see  also  van  der  Vaart  (1998). 
Further  references  to  the  bootstrap  are  given  in  Sect.  2.11.  As  a  simple  example 
of  the  sort  of  results  that  are  available,  we  quote  the  following,  a  proof  of  which 
may  be  found  in  Bickel  and  Freedman  (1981). 
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Result.  Consider  a  bootstrap  estimator  of  the  sample  mean,  p,  of  the  distribution 
F  and  assume  E[Y2]  <  oo  and  let  the  variance  of  F  be  a2.  Then  we  know 
that  y/n(Yn  —  p)  — >d  N(0,  cr2),  and  for  almost  every  sequence 

Vn(Y*n  ~Yn)  -> d  N(0,  er2). 

The  distribution  of  other  functions  of  interest  can  be  obtained  via  the  delta 
method;  see  van  der  Vaart  (1998).  There  are  two  approximations  that  are  being  used 
in  the  bootstrap.  First,  we  are  estimating  F  by  F,  and  second,  we  are  estimating  the 
quantity  of  interest,  for  example,  (i)  or  (ii),  using  B  samples  from  F .  For  example, 
if  (i)  is  of  interest,  an  obvious  estimator  of  var F(0)  is 


varF(0) 


1 

B 


E 


0(Yb*) 


1 

B 


B 

Ewi 


(2.47) 


In  this  case,  the  two  approximations  are 


and  the  first  approximation  may  be  poor  if  the  estimate  F  is  not  close  to  F,  but  we 
can  control  the  second  approximation  by  choosing  large  B.  For  the  nonparametric 
bootstrap,  we  could,  in  principle,  enumerate  all  possible  samples,  but  there  are  nn  of 

these,  of  which  (  ^ n  ^  ]  are  distinct,  which  is  far  too  large  a  number  to  evaluate 


in  practice. 

There  are  many  possibilities  for  computation  of  confidence  limits,  as  required  in 
(ii).  If  normality  of  9  is  reasonable,  then  (2.46)  is  straightforward  to  use  with  the 
variance  estimated  by  (2.47)  and  the  bias  by 


biasF 


0(y) 


1 

B 


Ew)' 


b= 1 


As  a  simple  alternative,  the  bootstrap  percentile  interval  for  a  confidence  interval  of 
coverage  1  —  a  is 

a*  n* 

Ua/2’tll-a/2 

where  and  0^_a^2  are  the  a/2  and  1  —  a/2  quantiles  of  the  bootstrap  estimates 

9(Yb*),  b  =  1  ,...,B.  More  refined  bootstrap  confidence  interval  procedures 
are  described  in  Davison  and  Hinkley  (1997).  For  example,  Exercise  2.9  outlines  the 
derivation  of  a  confidence  interval  based  on  a  pivot.  In  Sect.  2.7.3,  we  illustrate  the 
close  links  between  bootstrap  variance  estimation  and  sandwich  estimation. 

The  bootstrap  method  does  not  work  for  all  functions  of  interest.  In  particular, 
it  fails  in  situations  when  the  tail  behavior  is  not  well  behaved,  for  example,  a 
bootstrap  for  the  maximum  Y(nj  will  be  disastrous. 
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2. 7.2  The  Bootstrap  for  Regression 

The  parametric  and  nonparametric  methods  provide  two  distinct  versions  of  the 
bootstrap,  and  in  a  regression  context,  another  important  distinction  is  between  re¬ 
sampling  residuals  and  resampling  cases.  We  illustrate  the  difference  by  considering 
the  model 


Pi  =  f(xi,P)  +  ei, 


(2.48) 


where  the  residuals  e,  are  such  that  E[q]  =  0,  i  =  1 .,n  and  are  assumed 
uncorrelated.  The  two  methods  are  characterized  according  to  whether  we  take  F 
to  be  the  distribution  of  Y  only  or  of  { Y.  X  \.  In  the  resampling  residuals  approach, 
the  covariates  x,  are  considered  as  fixed,  and  bootstrap  datasets  are  formed  as 


Y-b)  =  f(xi,(3)  +  e  bi, 


where  a  number  of  options  are  available  for  sampling  ea,  b  =  1, . . . ,  B,  i  =  1, 
. . . ,  n.  The  simplest,  nonparametric,  version  is  to  sample  e/„  with  replacement  from 


Various  refinements  of  this  simple  approach  are  possible.  If  we  are  willing  to  assume 
(say)  that  e,  |  a2  ~ud  N(0,  cr2),  then  a  parametric  resampling  residuals  method 
samples  tu  ~  N(0,<r2)  based  on  an  estimate  a2.  In  a  model  such  as  (2.48), 
the  meaning  of  residuals  is  clear,  but  in  generalized  linear  models  (Chap.  6),  for 
example,  this  is  not  the  case  and  many  alternative  definitions  exist. 

The  resampling  residuals  method  has  the  advantage  of  respecting  the  “design,” 
that  is,  *i, ,  xn.  A  major  disadvantage,  however,  is  that  we  are  leaning  heavily 
on  the  assumed  mean-variance  relationship,  and  we  would  often  prefer  to  protect 
ourselves  against  an  assumed  model.  The  resampling  case  method  forms  bootstrap 
datasets  by  sampling  with  replacement  from  {Yj,  X,.  i  =  1, . . . ,  n}  and  does  not 
assume  a  mean-variance  model.  Again  parametric  and  nonparametric  versions  are 
available,  but  the  latter  is  preferred  since  the  former  requires  a  model  for  the  joint 
distribution  of  the  response  and  covariates  which  is  likely  to  be  difficult  to  specify. 
When  cases  are  resampled,  the  design  in  each  bootstrap  sample  will  not  in  general 
correspond  to  that  in  the  original  dataset  which,  though  not  ideal  (since  it  leads 
to  wider  confidence  intervals  than  necessary),  will  have  little  impact  on  inference, 
except  when  there  are  outliers  in  the  data;  if  the  outliers  are  sampled  multiple  times, 
then  instability  may  result. 


2. 7.3  Sandwich  Estimation  and  the  Bootstrap 


In  this  section  we  heuristically  show  why  we  would  often  expect  sandwich  and 
bootstrap  variance  estimates  to  be  in  close  correspondence.  For  simplicity,  we 
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consider  a  univariate  parameter  9,  and  let  0,,  denote  the  MLE  arising  from  a 
sample  of  size  n.  In  a  change  of  notation,  we  denote  the  score  by  S(9)  = 
[Si(0), . . . ,  5n(0)]T,  where  Si(9)  =  dli/d,6  is  the  contribution  to  the  score  from 
observation  Yi,  i  =  1, . . . ,  n.  Hence, 


S(0)  =  ^>(0)  =  S(0)T1 

i= 1 


where  1  is  an  n  x  1  vector  of  l’s.  The  sandwich  form  of  the  asymptotic  variance 
of  6n  is 


var 


(On) 


1  B 

n  A2 


where 


A(9)  =  E 


dS 

~dO 


B(9)  =  E  [S'(6»)2]  . 


These  quantities  may  be  empirically  estimated  via 


T  1  dS 

An  —  ZT 
n  dO 


1  dSj 

n  d9 

i= 1 


Bn  =  -s(eys(9) 

n 


1  " 


i—  1 


A  convenient  representation  of  a  bootstrap  sample  is  y*  =  Y  x  D  where  D  = 
diag(Di, . . . ,  Dn)  is  a  diagonal  matrix  consisting  of  a  multinomial  random  variable 


'r>x 

~  Multinomial 

n,  {  —  , . . 

A-)} 

\n 

nj 

.Dn. 

with 


E([D1,...,Dn]T)  =  l 

var  ([Di, . . . ,  Dn]T)  =  In  —  — 11T  — >■  I„ 

n 

as  n  —>  oo.  The  MLE  of  9  in  the  bootstrap  sample  is  denoted  0*  and  satisfies 
5*(0*)  =  0,  where  S*(9)  is  the  score  corresponding  to  Y*.  Note  that 

n  n 

S*(0)  =  Y,S!(9)  =  Y,SA0)Di. 


68 


2  Frequentist  Inference 


We  consider  a  one-step  Newton-Raphson  approximation  (see  Sect.  6.5.2  for  a 
more  detailed  description  of  this  method)  to  0*  and  show  that  this  leads  to  a 
bootstrap  variance  estimate  that  is  approximately  equal  to  the  sandwich  variance 
estimate.  The  following  informal  derivation  is  carried  out  without  stating  regularity 
conditions.  It  is  important  to  emphasize  that  throughout  we  are  conditioning  on  Y 
and  therefore  on  9n.  A  first-order  Taylor  series  approximation 

^  ^  ^  ^  rl  Q* 

o  =  s*(e*)*s*(on)  +  (o*n-en)  — 


leads  to  the  one-step  approximation 


9n  — 


S*(9n) 


Jn  wri  d 


The  bootstrap  score  evaluated  at  9n  is 

n  n 


i=l 


i= 1 


unless  the  bootstrap  sample  coincides  with  the  original  sample,  that  is,  unless 


D  =  ln.  We  replace  jsS*(9)\ 


by  its  limit 


=  E 


2=1 


=  Y 


2=1 


E [Di\  =  nx  An 


where  An  =  ~  ^5(0)  |  ^  .  Therefore,  the  one-step  bootstrap  estimator  is  approxi¬ 
mated  by 


On  «  dn  - 


s(enyp 

nAn 


=  0 


and  is  approximately  unbiased  as  an  estimator  since 

Pm*  s(enyE[D}  s(enyi 

tfn\  ~  T  —  T 

TiAn  nAn 

and,  recall,  9n  is  being  held  constant.  The  variance  is 

s{6nyvM{[D1,...,Dny)s(dn)  __  s{0ny(i- ±ir)s(9n) 


var(6»*  -  9n) 


(' nAn )2 

S(9nyiS(9n)  nBn 


( nAny 


B„ 


(nAny 


( nAn )2  nAl 
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Fig.  2.3  Sampling 
distribution  of  /3 1  arising 
from  the  nonparametric 
bootstrap  samples.  The  solid 
cun’e  is  the  asymptotic 
distribution  of  the  MLE  under 
the  Poisson  model,  and  the 
dashed  line  is  the  asymptotic 
distribution  under  the 
quasi-Poisson  model 


-0.07  -0.05  -0.03  -0.01 


Pi 


which  is  the  sandwich  estimator.  Hence,  var($*  —  9n)  approximates  var(#„  —  d), 
which  is  a  fundamental  link  in  the  bootstrap.  For  a  more  theoretical  treatment,  see 
Arcones  and  Gine  (1992)  and  Sect.  10.3  of  Kosorok  (2008). 


Example:  Lung  Cancer  and  Radon 

For  the  lung  cancer  and  radon  example,  we  implement  the  nonparametric  bootstrap 
resampling  B  =  1,000  sets  of  n  case  triples  [Yf:.  ££?;,  x^],  b  =  1  = 

1, . . . ,  n.  Figure  2.3  displays  the  histogram  of  estimates  arising  from  the  bootstrap 
samples,  along  with  the  asymptotic  normal  approximations  to  the  sampling  distri¬ 
bution  of  the  estimator  under  the  Poisson  and  quasi-Poisson  models.  We  see  that 
the  distribution  under  the  quasi-likelihood  model  is  much  wider  than  that  under  the 
Poisson  model.  This  is  not  surprising  since  we  have  already  seen  that  the  lung  cancer 
data  are  overdispersed  relative  to  a  Poisson  distribution.  The  bootstrap  histogram 
and  quasi-Poisson  sampling  distribution  are  very  similar,  however. 

Table  2.4  summarizes  inference  for  /3i  under  a  number  of  different  methods 
and  again  confirms  the  similarity  of  asymptotic  inference  under  the  quasi-Poisson 
model  and  nonparametric  bootstrap.  In  this  example  the  similarity  in  the  intervals 
from  quasi-likelihood,  sandwich  estimation,  and  the  nonparametric  bootstrap  is 
reassuring.  The  point  estimates  from  the  Poisson,  quasi-likelihood,  and  sandwich 
approaches  are  identical.  The  point  estimate  from  the  quadratic  variance  model  (that 
arises  from  a  negative  binomial  model)  is  slightly  closer  to  zero  for  these  data,  due 
to  the  difference  in  the  variance  models  over  the  large  range  of  counts  in  these  data. 
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Table  2.4  Comparison  of  inferential  summaries  over  various  approaches,  for  the  lung 
cancer  and  radon  example 


Inferential  method 

01 

s.e.(3i) 

95%  Cl  for  exp(/3i) 

Poisson 

-0.036 

0.0054 

0.954,  0.975 

Quasi-likelihood 

-0.036 

0.0090 

0.947,  0.982 

Quadratic  variance 

-0.030 

0.0085 

0.955,  0.987 

Sandwich  estimation 

-0.036 

0.0080 

0.949,  0.980 

Bootstrap  normal 

-0.036 

0.0087 

0.948,  0.981 

Bootstrap  percentile 

-0.036 

0.0087 

0.949,  0.981 

The  last  two  lines  refer  to  nonparametric  bootstrap  approaches,  with  intervals  based  on 
normality  of  the  sampling  distribution  of  the  estimator  (“Normal”)  and  on  taking  the 
2.5%  and  97.5%  points  of  this  distribution  (“Percentile”) 


2.8  Choice  of  Estimating  Function 

The  choice  of  estimating  function  is  driven  by  the  conflicting  aims  of  efficiency 
and  robustness  to  model  misspecification.  If  the  likelihood  corresponds  to  the 
true  model,  then  MLEs  are  asymptotically  efficient  so  that  asymptotic  confidence 
intervals  have  minimum  length.  However,  if  the  assumed  model  is  incorrect,  then 
there  are  no  guarantees  of  even  consistency  of  estimation. 

Basing  estimating  functions  on  simple  model-free  functions  of  the  data  often 
provides  robustness.  As  we  discuss  in  Sect.  5.6.3,  the  classic  Gauss-Markov 
theorem  states,  informally,  that  among  estimators  that  are  linear  in  the  data,  the 
least  squares  estimator  has  smallest  variance,  and  this  result  is  true  for  fixed  sample 
sizes.  There  is  also  a  Gauss-Markov  theorem  for  estimating  functions.  Suppose 
E[Yi  |  (3]  =  var(Fi)  =  erf  and  cov(li,  Yf)  =  0,  i  ^  j,  and  consider  the  class 

of  linear  unbiased  estimating  functions  (of  zero)  that  are  of  the  form 

n 

G{0)  =  Y,  K  -  ^)} ,  (2-49) 

i=l 

where  a,i(f3)  are  specified  nonrandom  functions,  subject  to  ai(/3)  =  c,  a 

constant  (this  is  to  avoid  obtaining  an  arbitrarily  small  variance  by  multiplying 
the  estimating  function  by  a  constant).  The  estimating  function  (2.49)  provides  a 
consistent  estimator  /3  so  long  as  the  mean  //,(/9)  is  correctly  specified.  It  can  be 
shown,  for  example,  Godambe  and  Heyde  (1987),  that 

E [UUT]  <  E[GGT],  (2.50) 

where 

U((3)  =  D'V~\Y-li)/a, 

so  that  this  estimating  function  has  the  smallest  variance.  Quasi-likelihood  estima¬ 
tors  are  therefore  asymptotically  optimal  in  the  class  of  linear  estimating  functions 
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and  will  be  asymptotically  efficient  if  the  quasi-score  functions  correspond  to  the 
score  of  the  likelihood  of  the  true  data-generating  model.  Of  course  a  superior 
estimator  (in  terms  of  efficiency)  may  result  from  an  estimating  function  that  is  not 
linear  in  the  data,  if  the  data  arise  from  a  model  for  which  the  score  function  is  not 
linear.  The  consideration  of  quadratic  estimating  functions  illustrates  the  efficiency- 
robustness  trade-off. 

Result  (2.50)  is  true  for  an  estimating  function  based  on  a  finite  sample  size  n , 
though  there  is  no  such  result  for  the  derived  estimator.  However,  the  estimator 
derived  from  the  estimating  function  is  asymptotically  efficient  (e.g.,  McCullagh 
1983).  The  optimal  estimating  equation  is  that  which  has  minimum  expected 
distance  from  the  score  equation  corresponding  to  the  true  model.  We  reemphasize 
that  a  consistent  estimator  of  the  parameters  in  the  assumed  regression  model  is 
obtained  from  the  quasi-score  (2.50),  and  the  variance  of  the  estimator  will  be 
appropriate  so  long  as  the  second  moment  of  the  data  has  been  specified  correctly. 

To  motivate  the  class  of  quadratic  estimating  functions  suppose 

Yi\P~indN[m(p),o$(0j\  , 


i  =  1 , ,n.  The  log-likelihood  is 


KP)  =  -X^logCTl^ 

i= 1 


i 

2h  ^)2 


which  gives  the  quadratic  score  equations 


S(f3) 


6 'l_ 

8(3 

\p  {Yi  —  8/j.i  yp  {[Yi  —  iii([3)}2  —  07  (/3)2}  dai 

am2  9P  £i  W 


(2.51) 


If  the  first  two  moments  are  correctly  specified,  then  EfS1)/?)]  =  0,  so  that  a 
consistent  estimator  is  obtained. 

In  general,  we  may  consider 


E  oiGs)  [Yi  -  mm + up){\Yi  -  mm2  -  ^(/3)2} , 

i= 1 


where  aj(/3),  bi(/3)  are  specified  nonrandom  functions.  With  this  estimating  func¬ 
tion,  the  information  in  the  variance  concerning  the  parameters  f3  is  being  used 
to  improve  efficiency.  Among  quadratic  estimating  functions,  it  can  be  shown  that 
(2.51)  is  optimal  in  the  sense  of  producing  estimators  that  are  asymptotic  efficient 
(Crowder  1987).  In  general,  to  choose  the  optimal  estimating  function,  the  first  four 
moments  of  the  data  must  be  known,  which  may  seem  unlikely,  but  this  approach 
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may  be  contrasted  with  the  use  of  the  score  as  estimating  function  which  effectively 
requires  all  of  the  moments  to  be  known.  There  are  two  problems  with  using 
quadratic  estimating  functions.  First,  consistency  requires  the  first  two  moments 
to  be  correctly  specified.  Second,  to  estimate  the  covariance  matrix  of  the  estimator, 
the  skewness  and  kurtosis  must  be  estimated,  and  these  may  be  highly  unstable.  We 
return  to  this  topic  in  Sect.  9. 10. 


2.9  Hypothesis  Testing 

Throughout  the  book,  we  emphasize  estimation  over  hypothesis  testing,  for  reasons 
discussed  in  Chap.  4,  but  in  this  section  describe  the  rationale  and  machinery  of 
frequentist  hypothesis  testing. 


2.9.1  Motivation 

A  common  aim  of  statistical  analysis  is  to  judge  the  evidence  from  the  data 
in  support  of  a  particular  hypothesis,  defined  through  specific  parameter  values. 
Hypothesis  tests  have  historically  been  used  for  various  purposes,  including: 

•  Determining  whether  a  set  of  data  is  consistent  with  a  particular  hypothesis 

•  Making  a  decision  as  to  which  of  two  hypotheses  is  best  supported  by  the  data 

We  assume  there  exists  a  test  statistic  T  =  T(Y)  with  large  values  of  T 
suggesting  departures  from  H0.  In  Sects.  2. 9. 3-2. 9. 5,  three  specific  recipes  are 
described,  namely,  score,  Wald,  and  likelihood  ratio  test  statistics.  We  define  the 
p-value,  or  significance  level,  as 

p=p(Y)=Pr[T(Y)>T(y)\H0], 

so  that,  intuitively,  if  this  probability  is  “small,”  the  data  are  inconsistent  with  H0. 
If  T(Y)  is  continuous,  then  under  //0,  the  p- value  p(Y)  follows  the  distribution 
U(0, 1).  Consequently,  the  significance  level  is  the  observed  p(y).  The  distribution 
of  T{Y)  under  H0  may  be  known  analytically  or  may  be  simulated  to  produce  a 
Monte  Carlo  or  bootstrap  test. 

The  nomenclature  associated  with  the  broad  topic  of  hypothesis  testing  is 
confusing,  but  we  distinguish  three  procedures: 

1.  A  pure  significance  test  calculates  p  but  does  not  reject  Ho  and  is  often  viewed 
as  an  exploratory  tool. 

2.  A  test  of  significance  sets  a  cutoff  value  a  (e.g.,  a  =  0.05)  and  rejects  H0  if 
p  <  a  corresponding  to  T  >  Ta.  The  latter  is  known  as  the  critical  region. 

3.  A  hypothesis  test  goes  one  step  further  and  specifies  an  alternative  hypothesis. 
Hi.  One  then  reports  whether  Hq  is  rejected  or  not.  The  null  hypothesis  has 
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special  position  as  the  “status  quo,’’  and  conventionally  the  phrase  “accept  Hq” 
is  not  used  because  not  rejecting  may  be  due  to  low  power  (perhaps  because  of  a 
small  sample  size)  as  opposed  to  Hq  being  true. 

Rejecting  Hq  when  it  is  true  is  known  as  a  type  1  error,  and  a  type  II  error  occurs 
when  Hq  is  not  rejected  when  it  is  in  fact  false.  To  evaluate  the  probability  of  a 
type  II  error,  specific  alternative  values  of  the  parameters  need  to  be  considered.  The 
power  is  defined  as  the  probability  of  rejecting  Hq  when  it  is  false.  We  emphasize 
that  a  test  of  significance  may  reject  Hq  for  general  departures,  while  a  hypothesis 
test  rejects  in  the  specific  direction  of  H\. 

A  key  point  is  that  the  consistency  of  the  data  with  Hq  is  being  evaluated,  and 
there  is  no  reference  to  the  probability  of  the  null  hypothesis  being  true.  As  usual  in 
frequentist  inference,  Hq  is  a  fixed  unknown  and  probability  statements  cannot  be 
assigned  to  it.4 


2.9.2  Preliminaries 

We  consider  a  p-dimensional  vector  of  parameters  9  and  consider  two  testing 
situations.  In  the  first,  we  consider  the  simple  null  hypothesis  Hq  :  9  =  9q 
versus  the  alternative  H\  :  6  ^  6q.  In  the  second,  we  consider  a  partition  of  the 
parameter  vector  6  =  [0i,  9 2}-  where  the  dimensions  of  9\  and  9-2  are  p  —  r  and  r, 
respectively,  and  a  composite  null.  Specifically,  in  the  composite  case,  we  compare 
the  hypotheses: 


Hq  :  9\  unrestricted,  9 2  =  9>q 

Hi  :  0  =  [0i,02]  +  [0i,02o]- 


As  a  simple  example,  in  a  regression  context,  let  9  =  \0  \ .  62]  with  ()\  the  intercept 
and  92  the  slope.  We  may  then  be  interested  in  Hq  :  6*2  =  0  with  0  \  unspecified.  In 
both  the  simple  and  composite  situations,  the  unrestricted  MLE  under  the  alternative 
is  denoted  9n  =  \9nl ,  9n2}. 

For  simplicity  of  exposition,  unless  stated  otherwise,  we  suppose  that  the  re¬ 
sponses  Yi,i  =  1, . . . ,  n,  are  independent  and  identically  distributed.  Consequently 
we  have  p(y  \  9)  =  JlILi  P(j/»  I  ®)-  The  extension  to  the  nonidentically  distributed 
situation,  as  required  for  regression,  is  straightforward.  The  p  x  1  score  vector  is 


4As  described  in  Chap.  3,  in  the  Bayesian  approach  to  hypothesis  testing,  a  prior  distribution  is 
placed  on  the  alternatives  (and  on  the  null),  allowing  the  calculation  of  the  probability  of  Ho  given 
the  data,  relative  to  other  hypotheses  under  consideration. 
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where  k(G)  is  the  log-likelihood  contribution  from  observation  i,  i  =  1, . . . .  n. 
Let  Sn{6 )  =  [iS„i(0),  5:„2(0)]t  be  a  partition  of  the  score  vector  with  Sn\(G)  of 
dimension  (p  —  r)  x  1  and  Sn 2(G)  of  dimension  r  x  1.  Under  the  composite  null, 
let  =  [Gn  10,^20]  denote  the  MLE,  where  9n  10  is  found  from  the  estimating 
equation 

Snl(GnlO,  #2o)  =  0. 

In  general,  GnW  /  Gnl. 

In  the  independent  and  identically  distributed  case,  In(G )  =  nI-\  (0)  is  the 
information  in  a  sample  of  size  n.  Suppressing  the  dependence  on  6,  let 

T  Ill  1 12 

11  ~  r  T 

L  J-21  1 22  J 

denote  a  partition  of  the  expected  information  matrix,  where  In,  1 12, 121,  and  I22 
are  of  dimensions  ( p—r )  x  {p  —  r),  (p  —  r)  x  r,  r  x  ( p  —  r ),  and  r  x  ?\  respectively. 
The  inverse  of  Ji  is 


*r1  = 


£  11-2 


—  1  r-1 


_ r— 1  r  y— 1  r 

-£22-1-£21-*h  -*22-1 


where 

I12I22 121 
i2iinh2 


Ill-2  —  Ill  — 
122-1  =  I 22  — 


using  results  from  Appendix  B. 


2.9.3  Score  Tests 

We  begin  with  the  simple  null  Hq  :  0  -  8{i.  Recall  the  asymptotic  distribution  of 
the  score,  given  in  (2.17): 

n-^SAd)  ->• d  Np[0,h(0)]. 

Therefore,  under  the  null  hypothesis 

Sn(Oayi^(00)Sn(90)/n  4.  (2.52) 

Intuitively,  if  the  elements  of  Sn(6 0)  are  large,  this  means  that  the  components  of 
the  gradient  at  G0  are  large.  The  latter  occurs  when  Go  is  “far”  from  the  estimator 
Gn  for  which  Sn(Gn)  =  0.  In  (2.52),  the  matrix  1  (Go)  is  scaling  the  gradient 
distance.  The  information  may  be  evaluated  at  the  MLE,  Gn,  rather  than  at  Go,  since 
Ii{Gn)  — >p  Ii(Gq),  by  the  weak  law  of  large  numbers. 
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Under  the  composite  null  hypothesis,  Hq  :  0  \  unrestricted,  6>  =  02q\ 

As  a  simplification,  we  can  express  this  statistic  in  terms  of  partitioned  information 
matrices.  Since  r  elements  of  the  score  vector  are  zero,  that  is,  Sn2  (9\\)  =  0,  we 
have 

S„1(^)T/1-1!2(^)S„1(^)/n  ->d  Xr- 

Hence,  the  model  only  needs  to  be  fitted  under  the  null.  Each  of  the  score  statistics 
remains  asymptotically  valid  on  replacement  of  the  expected  information  by  the 
observed  information. 


2.9.4  Wald  Tests 

Under  the  simple  null  hypothesis  Hq  :  0  =  0q,  the  Wald  statistic  is  based  upon  the 
asymptotic  distribution 

V^{0n  -  Oo)  ->• d  Np  [0,  I\  (0o)-1]  ,  (2.53) 

and  the  Wald  statistic  is  the  quadratic  form  based  on  (2.53): 

>fr{dn-oQyii{eQ)yfa{en-eQ)  ->• d  X2P-  (2.54) 

An  alternative  form  that  is  often  used  in  practice  is 

y/n(0n  ~  90)T Ii(0n)Vn{9n  -  90)  X2p , 

which  again  follows  because  Ii(6n)  — >p  Ii{6q),  by  the  weak  law  of  large  numbers. 

Under  a  composite  null  hypothesis,  the  Wald  statistic  is  based  on  the  marginal 
distribution  of  0n2: 

\pd(9n 2  —  02o)T-fll-2(^)\/r7(0n2  ^2o)  d  Xr- 

The  observed  information  may  replace  the  expected  information  in  either  form  of 
the  Wald  statistic. 


2.9.5  Likelihood  Ratio  Tests 

Finally,  we  consider  the  likelihood  ratio  statistic  which,  under  a  simple  null,  is 

ln(9n)  —  ln(6o)  ■ 


2 
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Unlike  the  score  and  Wald  statistics,  the  asymptotic  distribution  is  not  an  obvious 
quadratic  form,  and  so  we  provide  a  sketch  proof  of  the  asymptotic 
distribution  under  //(J.  A  second-order  Taylor  expansion  of  ln(9 o)  about  9n  gives 

ln(00)  =  ln(9n)  +  (0Q  ~  9nY 


,h0  Z  V  d  l^ 
+  2(0°  ddde T 


(00  -  0n), 


where  0  is  between  6>q  and  Qn.  The  middle  term  on  the  right-hand  side  is  zero,  and 


1  d2ln(0 ) 
n  dQddT 


—>p  —Ii{9q)- 


Hence, 


-2 


ln{Qo) 


2 


ln{9n)  ~  ln(9 0) 


n(0n-0o)T/i(0o)(0n-0o), 


and  so 


ln[9n)  —  ln{6  o) 


~^d  Xp- 


Similarly,  under  a  composite  null  hypothesis: 


2 


ln(0n)-ln(9°n) 


d  Xr- 


(2.55) 


2.9.6  Quasi-likelihood 

We  briefly  consider  the  quasi-likelihood  model  described  in  Sect.  2.5.  The  score 
test  can  be  based  on  the  quasi-score  statistic  Un(f3)  =  DJV~1{Y  —  fi)/a,  with 
the  information  in  a  sample  of  size  n  being  DTV~1  D /a.  The  latter  is  also  used 
in  the  calculation  of  a  Wald  statistic  since  it  supplies  the  required  standard  errors. 
Similarly,  a  quasi-likelihood  ratio  test  can  be  performed  using  ln(0n,  a),  the  form 
of  which  is  given  in  (2.32).  Unknown  a  can  be  accommodated  by  substitution  of  a 
consistent  estimator  a.  For  example,  we  might  estimate  a  via  the  Pearson  statistic 
estimator  (2.31). 

If  one  wished  to  account  for  estimation  of  a,  then  one  possibility  is  to  assume 
that  (n  —  p)  x  2  follows  a  Xn-p  distribution  and  then  evaluate  significance  based 
on  the  ratio  of  scaled  y2-squared  random  variables,  to  give  an  F  distribution  under 
the  null  (see  Appendix  B).  Outside  of  the  normal  linear  model,  this  seems  a  dubious 
exercise,  however,  since  the  numerator  and  denominator  will  not  be  independent, 
and  either  of  the  x'2  approximations  could  be  poor.  The  use  of  an  F  statistic  is 
conservative,  however  (so  that  significance  will  be  reduced  over  the  use  of  the  plug¬ 
in  x2  approximation). 
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2. 9. 7  Comparison  of  Test  Statistics 

The  score  test  statistic  is  invariant  under  reparameterization,  provided  that  the 
expected,  rather  than  the  observed,  information  is  used.  The  score  statistic  may  also 
be  evaluated  without  second  derivatives  if  Sn(9o)Sn  (Off  is  used,  which  may  be 
useful  if  these  derivatives  are  complex,  or  unavailable.  The  score  statistic  requires 
the  value  of  the  score  at  the  null,  but  the  MLE  under  the  alternative  is  not  required. 

Confidence  intervals  can  be  derived  directly  from  the  Wald  statistic  so  that  there 
is  a  direct  link  between  estimation  and  testing.  Interpretation  is  also  straightforward; 
in  particular,  statistical  versus  practical  significance  can  be  immediately  considered. 
A  major  drawback  of  the  Wald  statistic  is  that  it  is  not  invariant  to  the  parameteri¬ 
zation  chosen,  which  ties  in  with  our  earlier  observation  (Sect.  2.3)  that  asymptotic 
confidence  intervals  are  more  accurate  on  some  scales  than  on  others.  The  Wald 
statistic  uses  the  MLE  but  not  the  value  of  the  maximized  likelihood. 

The  likelihood  ratio  statistic  is  invariant  under  reparameterization.  Confidence 
intervals  derived  from  likelihood  ratio  tests  always  preserve  the  support  of  the  pa¬ 
rameter,  unlike  score-  and  Wald-based  intervals  (unless  a  suitable  parameterization 
is  adopted).  Similar  to  the  attainment  of  the  Cramer-Rao  lower  bound  (Appendix  F), 
there  is  an  elegant  theory  under  which  the  likelihood  ratio  test  statistic  emerges  as 
the  uniformly  most  powerful  (UMP)  test,  via  the  famous  Neyman-Pearson  lemma; 
see,  for  example,  Schervish  (1995).  The  likelihood  ratio  test  requires  the  fitting  of 
two  models. 

The  score,  Wald,  and  likelihood  ratio  test  statistics  are  asymptotically  equivalent 
but  are  not  equally  well  behaved  in  finite  samples.  In  general,  and  by  analogy 
with  the  asymptotic  optimality  of  the  MLE,  the  likelihood  ratio  statistic  is  often 
recommended  for  use  in  regular  models.  If  6n  and  9 q  are  close,  then  the  three 
statistics  will  tend  to  agree. 

Chapter  4  provides  an  extended  discussion  and  critique  of  hypothesis  testing. 


Example:  Poisson  Mean 

We  illustrate  the  use  of  the  three  statistics  in  a  simple  context.  Suppose  we  have  data 
Y,j  |  A  Poisson(A),  i  =  1, . . . ,  n,  and  we  are  interested  in  Hq  :  A  =  Ao-  The 
log-likelihood,  score,  and  information  are 


ln( A)  =  —  n\  +  riY  log  A, 

4(A)  = 
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X 

Fig.  2.4  Geometric  interpretation  of  score,  Wald,  and  likelihood  ratio  (LR)  statistics,  for  Poisson 
data  and  a  test  of  Ho  :  Ao  =  1,  with  data  resulting  in  A  =  y  =  0.6 

The  score  and  Wald  statistics  follow  from  (2.52)  and  (2.54)  and  both  lead  to 


n(Y  -  Aq)2 
Ao 


~^d  Xl 


under  the  null.  From  (2.55),  the  likelihood  ratio  statistic  is 


2 n  [y (log  Y  -  log  A0)  -  (Y  -  A0)]  -+d  xl- 

Suppose  we  observe  Xw=i  2 H  =  12  events  in  n  =  20  trials  so  that  A  =  y  =  0.6. 
Assume  we  are  interested  in  testing  the  null  hypothesis  H0  :  Ao  =  1.0.  The  score 
and  Wald  statistics  are  3.20  and  the  likelihood  ratio  statistic  is  3.74,  with  associated 
observed  significance  levels  of  7.3%  and  5.4%,  respectively.  Figure  2.4  plots  the 
log-likelihood  against  A  for  these  data.  The  (unsealed)  statistics  are  indicated  on  the 
figure.  The  score  test  is  based  on  the  gradient  at  A0,  the  Wald  statistic  is  the  squared 
horizontal  distance  between  A  and  A0,  and  the  likelihood  ratio  test  statistic  is  two 
times  the  vertical  distance  between  1(A)  and  /(Ao). 

We  now  reparameterize  to  9  =  log  A,  so  that  the  null  becomes  H0  :  9  =  9q  =  0. 
The  likelihood  ratio  statistic  is  invariant  to  parameterization,  and  the  score  statistic 
turns  out  to  be  the  same  as  previously  in  this  example,  since  the  observed  and 
expected  information  are  equal.  The  forms  of  the  Wald,  score,  and  likelihood  ratio 
statistics,  for  general  6*o,  are 
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n(\ogY  -  e0)2  exp ( 6*o ) 
n  [F  -  exp(6»0)]  2exp(-0o) 

2 n  jy(0  -  6»0)  -  [exp(0)  -  exp(6»0)]| 

with  numeric  values  of  5.22,  3.20  and  3.74,  respectively,  in  the  example. 


2.10  Concluding  Remarks 

In  Sect.  1.2,  we  emphasized  that  model  formulation  should  begin  with  the  model 
that  is  felt  most  appropriate  for  the  context,  before  proceeding  to  determine  the 
behavior  of  inferential  procedures  under  this  model.  In  this  chapter  we  have  seen 
that  likelihood-based  inference  is  asymptotically  efficient  if  the  model  is  correct. 
Hence,  if  one  has  strong  belief  in  the  assumed  model,  then  a  likelihood  approach  is 
appealing,  particularly  if  the  score  equations  are  of  linear  exponential  family  form, 
since  in  this  case  consistent  estimators  of  the  parameters  in  the  assumed  regression 
model  are  obtained.  If  the  likelihood  is  not  of  linear  exponential  form,  then  there 
are  no  guarantees  of  consistency  under  model  misspecification.  So  far  as  estimation 
of  the  standard  error  is  concerned,  in  situations  in  which  n  is  sufficiently  large  for 
asymptotic  inference  to  be  accurate,  sandwich  estimation  or  the  bootstrap  may  be 
used  to  provide  consistent  model-free  standard  errors,  so  long  as  the  observations  are 
uncorrelated.  The  relevance  of  asymptotic  calculations  for  particular  sample  sizes 
may  be  investigated  via  simulation.  In  general,  sandwich  estimation  is  a  very  simple, 
broadly  applicable  and  appealing  technique. 

In  many  instances  the  context  and/or  questions  of  interest  may  determine  the 
mean  function  and  perhaps  give  clues  to  the  mean-variance  relationship.  The  form 
of  the  data  may  suggest  viable  candidates  for  the  full  probability  model.  A  caveat  to 
this  is  that  models  such  as  the  Poisson  or  exponential  for  which  there  is  no  dispersion 
parameter  should  be  used  with  extreme  caution  since  there  is  no  mechanism  to  “soak 
up”  excess  variability.  In  practice,  if  the  data  exhibit  overdispersion,  as  is  often  the 
case,  then  this  will  lead  to  confidence  intervals  that  are  too  short.  Information  on 
the  mean  and  variance  may  be  used  within  a  quasi-likelihood  approach  to  define 
an  estimator,  and  if  n  is  sufficiently  large,  sandwich  estimation  can  provide  reliable 
standard  errors.  Experience  of  particular  models  may  help  to  determine  whether 
the  assumption  of  a  particular  likelihood  with  the  desired  mean  and  variance 
functions  is  likely  to  be  much  less  reliable  than  a  quasi-likelihood  approach.  The 
choice  of  how  parametric  one  wishes  to  be  will  often  come  down  to  personal  taste. 

We  finally  note  that  the  efficiency-robustness  trade-off  will  be  weighted  in 
different  directions  depending  on  the  nature  of  the  analysis.  In  an  exploratory 
setting,  one  may  be  happy  to  proceed  with  a  likelihood  analysis,  while  in  a 
confirmatory  setting,  one  may  want  to  be  more  conservative. 
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2.11  Bibliographic  Notes 

Numerous  accounts  of  the  theory  behind  frequentist  inference  are  available,  Cox  and 
Hinkley  (1974)  remains  a  classic  text.  Casella  and  Berger  (1990)  also  provides  an  in- 
depth  discussion  of  frequentist  estimation  and  hypothesis  testing.  A  mathematically 
rigorous  treatment  of  the  estimating  functions  approach  is  provided  by  van  der  Vaart 
(1998).  A  gentler  and  very  readable  presentation  of  a  reduced  amount  of  material  is 
Ferguson  (1996).  Further  discussion  of  estimating  functions,  particularly  for  quasi¬ 
likelihood,  may  be  found  in  Fleyde  (1997)  and  Crowder  (1986). 

Likelihood  was  introduced  by  Fisher  (1922,  1925b),  and  quasi-likelihood  by 
Wedderburn  (1974).  Asymptotic  details  for  quasi-likelihood  are  described  in 
McCullagh  (1983),  while  Gauss-Markov  theorems  detailing  optimality  are  de¬ 
scribed  in  Godambe  and  Heyde  (1987)  and  Heyde  (1997).  Firth  (1993)  provides 
an  excellent  review  of  quasi-likelihood. 

Crowder  (1987)  gives  counterexamples  that  reveal  situations  in  which  quasi¬ 
likelihood  is  unreliable.  Linear  and  quadratic  estimating  functions  are  described 
by  Firth  (1987)  and  Crowder  (1987).  Firth  (1987)  also  investigates  the  efficiency 
of  quasi-likelihood  estimators  and  concludes  that  such  estimators  are  robust  to 
“moderate  departures”  from  the  likelihood  corresponding  to  the  score. 

The  form  of  the  sandwich  estimator  was  given  in  Huber  (1967).  White  (1980) 
implemented  the  technique  for  the  linear  model,  and  Royall  (1986)  provides  a  clear 
and  simple  account  with  many  examples.  Carroll  et  al.  (1995,  Appendix  A. 3)  gives 
a  very  readable  review  of  sandwich  estimation. 

Efron  (1979)  introduced  the  bootstrap,  and  subsequently  there  has  been  a  huge 
literature  on  its  theoretical  properties  and  practical  use.  Bickel  and  Freedman  (1981) 
and  Singh  (1981)  provide  early  theoretical  discussions;  see  also  van  der  Vaart 
(1998).  Book-length  treatments  include  Efron  and  Tibshirani  (1993)  and  Davison 
and  Hinkley  (1997). 

The  score  test  was  introduced  in  Rao  (1948)  as  an  alternative  to  the  likelihood 
ratio  and  Wald  tests  introduced  in  Neyman  and  Pearson  (1928)  and  Wald  (1943), 
respectively.  Consequently,  the  score  test  is  sometimes  known  as  the  Rao  score  test. 
Cox  and  Hinkley  (1974)  provide  a  general  discussion  of  hypothesis  testing.  Peers 
(1971)  compares  the  power  of  score,  Wald,  and  likelihood  ratio  tests.  An  excellent 
expository  article  on  the  three  statistics,  emphasizing  a  geometric  perspective,  may 
be  found  in  Buse  (1982). 


2.12  Exercises 


2.1  Suppose  Y-[ ,  Y-2  \  9  ~ud  U  {9  —  0.5, 9  +  0.5) .  Show  that  Pr(min{Ti,  Y%}  < 
9  <  max{yi,y2}  |  9)  =  0.5,  so  that  [min{Ti,  Y^},  maxjli,  Y2}  ]  is  a 
50%  confidence  interval  for  9.  Suppose  we  observe  a  particular  interval  with 
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max{Yi,Y2}  —  min{Yi ,^2}  >  0.5.  Show  that  in  this  case  we  know  with 

probability  1  that  this  interval  contains  9. 5 

2.2  Consider  a  single  observation  from  a  Poisson  distribution:  Y  \  9  ~  Poisson(ti). 

(a)  Suppose  we  wish  to  estimate  exp(— 39).  Show  that  the  UMVUE  is  (— 2)v 
for  y  =  0, 1, 2, ...  Is  this  a  reasonable  estimator? 

(b)  Suppose  we  wish  to  estimate  92.  Show  that  T(T  —  1  )/n2  is  the  UMVUE 
for  92.  By  examining  the  case  T  =  1  comment  on  whether  this  is  a  sensible 
estimator. 

2.3  Let  Yi  |  a2  ~ud  N(/x,  cr2)  with  /t  known. 

(a)  Show  that  the  distribution  p{y  \  a2)  is  a  one-parameter  exponential  family 
member. 

(b)  Show  that  a2  =  -  XT-i  —  P)  *s  an  unbiased  estimator  of  a2  and 
evaluate  its  variance. 

(c)  Consider  estimators  of  the  form  a2  =  a  {Yt  —  y)2.  Determine  the 
value  of  a  that  minimizes  the  mean  squared  error. 

(d)  The  use  of  mean  squared  error  to  judge  an  estimator  is  appropriate  for  a 
quadratic  loss  function,  in  this  case  L(a2,  cr2)  =  ( a 2  —  a2)2.  Since  a2  >  0, 
there  is  an  asymmetry  in  this  loss  function.  Hence,  explain  why  downward 
bias  in  an  estimator  of  a2  can  be  advantageous. 

(e)  Show  that  a2  is  optimal  amongst  estimators  a2  with  respect  to  the  Stein 
loss  function 


2.4  Suppose  Yi  \  9Z  ~ind  Poisson (9i)  with  9Z  ~ind  Ga (ytb,  b)  for  i  =  1, . . . ,  n. 

(a)  Show  that  E[Yj]  =  jij  and  var(Ui)  =  /z»(  1  +  b~x). 

(b)  Show  that  the  marginal  distribution  of  Yt  \  //,  ,  b  is  negative  binomial. 

(c)  Suppose  log  fij  =  Bq  -(-  /3\ Xi.  Write  down  the  likelihood  function  L(f3 ,  b), 
log-  likelihood  function  l(/3,b),  score  function  S(f3,b),  and  expected 
information  matrix  I(f3,b). 

2.5  Consider  the  exponential  regression  problem  with  independent  responses 


p(Vi  |  A*)  =  \e  XiVi,  yi  >  0 


and  log  Xi  =  —  f3o  —  fiiXi  for  given  covariates  x i,  i  =  1, . . . ,  n.  We  wish  to 
estimate  the  2x1  regression  parameter  f3  =  [/30,  j3\]T  using  MLE. 


5  This  exercise  shows  that  although  the  confidence  interval  has  the  correct  frequentist  coverage 
when  averaging  over  all  possible  realizations  of  data,  for  some  data  we  know  with  probability  1 
that  the  specific  interval  created  contains  the  parameter.  The  probability  distribution  of  the  data  in 
this  example  is  not  regular  (since  the  support  of  the  data  depends  on  the  unknown  parameter),  and 
so  we  might  anticipate  difficulties.  Conditioning  on  an  ancillary  statistic  resolves  the  problems;  see 
Davison  (2003,  Example  12.3). 


82 


2  Frequentist  Inference 


Table  2.5  Survival  times  m  and  concentrations  of  a  contaminant  Xi  for  i  =  1, . . . ,  15 


i 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Xi 

6.1 

4.2 

0.5 

OO 

oo 

1.5 

9.2 

8.5 

8.7 

6.7 

6.5 

6.3 

6.7 

0.2 

8.7 

7.5 

Vi 

0.8 

3.5 

12.4 

i.i 

8.9 

2.4 

0.1 

0.4 

3.5 

8.3 

2.6 

1.5 

16.6 

0.1 

1.3 

(a)  Find  expressions  for  the  likelihood  function  L{(3),  log-likelihood  function 
l((3),  score  function  S((3 ),  and  Fisher’s  information  matrix  /(/ 3). 

(b)  Find  expressions  for  the  maximum  likelihood  estimate  (3.  If  no  closed-form 
solution  exists,  then  instead  provide  a  functional  form  that  could  be  simply 
implemented. 

(c)  For  the  data  in  Table  2.5,  numerically  maximize  the  likelihood  function  to 
obtain  estimates  of  (3.  These  data  consist  of  the  survival  times  (y)  of  rats 
as  a  function  of  concentrations  of  a  contaminant  ( x ).  Find  the  asymptotic 
covariance  matrix  for  your  estimate  using  the  information  T(/3).  Provide  a 
95%  confidence  interval  for  each  of  /3o  and  j3\. 

(d)  Plot  the  log-likelihood  function  Z(/30,  (3\)  and  compare  with  the  log  of  the 
asymptotic  normal  approximation  to  the  sampling  distribution  of  the  MLE. 

(e)  Find  the  maximum  likelihood  estimate  j3o  under  the  null  hypothesis  Hq  : 

/Si  =o. 

(f)  Perform  score,  likelihood  ratio,  and  Wald  tests  of  the  null  hypothesis  Hq  : 
/3i  =  0  with  a  =  0.05.  In  each  case,  explicitly  state  the  formula  you  use  to 
compute  the  test  statistic. 

(g)  Summarize  the  results  of  the  estimation  and  hypothesis  testing  carried  out 
above.  In  particular,  address  the  question  of  whether  increasing  concentra¬ 
tions  of  the  contaminant  are  associated  with  a  rat’s  life  expectancy. 

2.6  Consider  the  so-called  Neyman-Scott  problem  (Neyman  and  Scott  1948)  in 

which  Yjj  |  /xj, a2  N(//j, a2),  i  =  1  =  1,2.  Obtain  the  MLE 

of  a2  and  show  that  it  is  inconsistent.  Why  does  the  inconsistency  arise  in  this 
example? 

2.7  Consider  the  example  discussed  at  the  end  of  Sect.  2.4.3  in  which  the  true 
distribution  is  gamma,  but  the  assumed  likelihood  is  exponential. 

(a)  Evaluate  the  form  of  the  sandwich  estimator  of  the  variance,  and  compare 
with  the  form  of  the  model-based  estimator. 

(b)  Simulate  data  from  Ga(4,2)  and  Ga(10,2)  distributions,  for  n  =  10  and 
n  =  30,  and  obtain  the  MLEs  and  sandwich  and  model-based  variance 
estimates.  Compare  these  variances  with  the  empirical  variances  observed 
in  the  simulations. 

(c)  Provide  figures  showing  the  log  of  the  gamma  densities  of  the  previous  part, 
plotted  against  y,  along  with  the  “closest”  exponential  densities. 

2.8  Consider  the  Poisson-gamma  random  effects  model  given  by  (2.33)  and  (2.34), 
which  leads  to  a  negative  binomial  marginal  model  with  the  variance  a  quadratic 
function  of  the  mean.  Design  a  simulation  study,  along  the  lines  of  that  which 
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produced  Table  2.3,  to  investigate  the  efficiency  and  robustness  under  the 
Poisson  model,  quasi-likelihood  (with  variance  proportional  to  the  mean),  the 
negative  binomial  model,  and  sandwich  estimation.  Use  a  loglinear  model 


log  Hi  =  Po+Pl^i, 

with  Xi  ~ud  N(0, 1),  for  i  =  1 ,n,  and  po  =  2,  pi  =  log 2.  You  should 
repeat  the  simulation  for  different  values  of  both  n  and  the  negative  binomial 
overdispersion  parameter  b.  Report  the  95%  confidence  interval  coverages  for 
Po  and  Pi,  for  each  model. 

2.9  A  pivotal  bootstrap  interval  is  evaluated  as  follows.  Let  Rn  =  0n  0  be  a  pivot, 
and  H (r)  =  PrF  ( R„  <  r)  be  the  distribution  function  of  the  pivot.  Now  define 
an  interval  Cn  =  [  an,  bn }  where 

an=0n-  H -1  (l  -  D 

bn  =  en-  h -1  (|)  . 


(a)  Show  that 

Pr(an  <  0n  <  bn)  =  1  -  a 

so  that  Cn  is  an  exact  100(1  —  a)%  confidence  interval  for  9. 

(b)  Hence,  show  that  the  confidence  interval  is  Cn  =  [an,  bn  ]  where 

an  =  9n-  H _1  (l  -  =  9n  -  r\_a/2 

=  2  9n  —  91_aj2 

bn  =  9n-  H _1  Q)  =9n~  r*/2 
=  -  0*/2 

where  r*  denotes  the  7  sample  quantile  of  the  B  bootstrap  samples 

[Rnii  ■  •  • )  Kb]  and  °*  the  7  sample  quantile  of  [9*a 9*B], 

[Hint:  To  evaluate  an  and  b„ ,  we  need  to  know  H,  which  is  unknown,  but 
may  be  estimated  based  on  the  bootstrap  estimates 

1  B 

H(r)  =  -YJI(Kb<r) 

D  b—1 


where  R*b  =  6*b  -  9n,b  =  1, . . . , B.  ] 


Chapter  3 

Bayesian  Inference 


3.1  Introduction 

In  the  Bayesian  approach  to  inference,  all  unknown  quantities  contained  in  a 
probability  model  for  the  observed  data  are  treated  as  random  variables.  This  is  in 
contrast  to  the  frequentist  view  described  in  Chap.  2  in  which  parameters  are  treated 
as  fixed  constants.  Specifically,  with  respect  to  the  inferential  targets  of  Sect.  2.1, 
the  fixed  but  unknown  parameters  and  hypotheses  are  viewed  as  random  variables 
under  the  Bayesian  approach.  Additionally,  the  unknowns  may  include  missing  data, 
or  the  true  covariate  value  in  an  errors-in-variables  setting. 

The  structure  of  this  chapter  is  as  follows.  In  Sect.  3.2  we  describe  the  constituents 
of  the  posterior  distribution  and  its  summarization  and  in  Sect.  3.3  consider  the 
asymptotic  properties  of  Bayesian  estimators.  Section  3.4  examines  prior  speci¬ 
fication,  and  in  Sect.  3.5  issues  relating  to  model  misspecihcation  are  discussed. 
Section  3.6  describes  one  approach  to  accounting  for  model  uncertainty  via 
Bayesian  model  averaging.  As  we  see  in  Sect.  3.2,  to  implement  the  Bayesian 
approach,  integration  over  the  parameter  space  is  required,  and  historically  this  has 
proved  a  significant  hurdle  to  the  routine  use  of  Bayesian  methods.  Consequently, 
we  discuss  implementation  issues  in  some  detail.  In  Sect.  3.7,  we  provide  a  descrip¬ 
tion  of  so-called  conjugate  situations  in  which  the  required  integrals  are  analytically 
tractable,  before  providing  an  overview  of  analytical  and  numerical  integration 
techniques,  importance  sampling,  and  direct  sampling  from  the  posterior.  One 
particular  technique,  Markov  chain  Monte  Carlo  (MCMC),  has  greatly  extended 
the  range  of  models  that  may  be  analyzed  with  Bayesian  methods,  and  Sect.  3.8 
is  devoted  to  a  description  of  MCMC.  Section  3.9  considers  the  important  topic 
of  exchangeability,  and  in  Sect.  3.10  hypothesis  testing  via  so-called  Bayes  factors 
is  discussed.  Section  3.11  considers  a  hybrid  approach  to  inference  in  which  the 
likelihood  is  taken  as  the  sampling  distribution  of  an  estimator  and  is  combined  with 
a  prior  via  Bayes  theorem.  Concluding  remarks  appears  in  Sect.  3.12,  including 
a  comparison  of  frequentist  and  Bayesian  approaches,  and  the  chapter  ends  with 
bibliographic  notes  in  Sect.  3.13. 
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3.2  The  Posterior  Distribution  and  Its  Summarization 


Let  6  =  [#i, ... ,  0p]T  denote  all  of  the  unknowns  of  the  model,  which  we  continue 
to  refer  to  as  parameters,  and  y  =  \y-\ , . . . ,  yn]T  the  vector  of  observed  data.  Also 
let  X  represent  all  relevant  information  that  is  currently  available  to  the  individual 
who  is  carrying  out  the  analysis,  in  addition  to  y.  In  the  following  description,  we 
assume  for  simplicity  that  each  element  of  6  is  continuous. 

Bayesian  inference  is  based  on  the  posterior  probability  distribution  of  6  after 
observing  y,  which  is  given  by  Bayes  theorem: 


p(0  I  y,i) 


p(y  |  G,1)tt(6  1 X) 

p(y  1 1) 


(3.1) 


There  are  two  key  ingredients:  the  likelihood  function  p(y  \  6,1)  and  the  prior 
distribution  7 r(0  |  I).  The  latter  represents  the  probability  beliefs  for  G  held 
before  observing  the  data  y.  Both  are  dependent  upon  the  current  information  X. 
Different  individuals  will  have  different  information  X,  and  so  in  general  their  prior 
distributions  (and  possibly  their  likelihood  functions)  may  differ.  The  denominator 
in  (3.1),  p(y  |  I),  is  a  normalizing  constant  which  ensures  that  the  right-hand 
side  integrates  to  one  over  the  parameter  space.  Though  of  crucial  importance,  for 
notational  convenience,  from  this  point  onwards  we  suppress  the  dependence  on  X, 
to  give 


p(G  |  y)  = 


p{y  |  G)n(G) 

p(y) 


where  the  normalizing  constant  is 


p{y)  =  [ p{y  I  S)n(6)  dd ,  (3.2) 

J  6 

and  is  the  marginal  probability  of  the  observed  data  given  the  model,  that  is,  the 
likelihood  and  the  prior.  Ignoring  this  constant  gives 

p(6  |  y)  oc  p(y  |  G)  x  n(6) 


or,  more  colloquially. 


Posterior  oc  Likelihood  x  Prior. 

The  use  of  the  posterior  distribution  for  inference  is  very  intuitively  appealing  since 
it  probabilistically  combines  the  information  on  the  parameters  contained  in  the  data 
and  in  the  prior. 

The  manner  by  which  inference  is  updated  from  prior  to  posterior  extends 
naturally  to  the  sequential  arrival  of  data.  Suppose  first  that  y  \  and  tj2  represent 
the  current  totality  of  data.  Then  the  posterior  is 
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P(9  I  2/1,  2/2) 


33(2/1 , 2/2  I  gMg) 

p(yi,V2) 


(3.3) 


Now  consider  a  previous  occasion  at  which  only  j/!  was  available.  The  posterior 
based  on  these  data  only  is 


p(o  I  y 1)  = 


p(r/i  |  0)tt(0) 

p(yi) 


After  observing  yi  and  before  observing  y2,  the  “prior”  for  6  corresponds  to 
the  posterior  p(9  \  y\ ),  since  this  distribution  represents  the  current  beliefs 
concerning  9.  We  then  update  via 


P{0  I  3/1, 3/2) 


P(y2  |  yi,9)n(6  |  yi) 

p(y2 1  yi) 


(3.4) 


Factorizing  the  right-hand  side  of  (3.3)  gives 

P(<> \«l,V!)  =  Piyi2l?'’e.>xPiV'\e)*ie\ 
p{y 2 1  yi)  p{y  1) 

which  equals  the  right-hand  side  of  (3.4).  Hence,  consistent  inference  based  on  yi 
and  7/2  is  reached  regardless  of  whether  we  produce  the  posterior  in  one  or  two 
stages.  In  the  case  of  conditionally  independent  observations, 

p(yi,y2 1  0)  =p{yi  I  o)p(y2 1  o) 


in  (3.3)  and 

p{y2  \yi,6)  =p{y2  \  9) 

in  (3.4). 

At  first  sight,  the  Bayesian  approach  to  inference  is  deceptively  straightforward, 
but  there  are  a  number  of  important  issues  that  must  be  considered  in  practice. 
The  first,  clearly  vital,  issue  is  prior  specification.  Second,  once  prior  and  like¬ 
lihood  ingredients  have  been  decided  upon,  we  need  to  summarize  the  (usually) 
multivariate  posterior  distribution,  and  as  we  will  see,  this  summarization  requires 
integration  over  the  parameter  space,  which  may  be  of  high  dimension.  Finally,  a 
Bayesian  analysis  must  address  the  effect  that  possible  model  misspecification  has 
on  inference.  Prior  specification  is  taken  up  in  Sect.  3.4  and  model  misspecification 
in  Sect.  3.5.  Next,  posterior  summarization  is  described. 

Typically  the  posterior  distribution  p(0  \  y)  is  multivariate,  and  marginal 
distributions  for  parameters  of  interest  will  be  needed.  The  univariate  marginal 
distribution  for  6i  is 

p{9i  |  y)  =  [  p{9  |  y)  dO^i,  (3.5) 

J  6_i 

where  is  the  vector  9  excluding  6i,  that  is,  =  [6*i, . . . ,  0j_i,  6^+1, . . . ,  9P]. 
While  examining  the  complete  distribution  will  often  be  informative,  reporting 
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summaries  of  this  distribution  is  also  useful.  To  this  end  moments  and  quantiles 
may  be  calculated.  For  example,  the  posterior  mean  is 

E [9i  |  y\=  [  9ip(6i  |  y)  d,0i.  (3.6) 

Jdi 

The  100  x  q%  quantile,  9i(q ),  with  0  <  q  <  1  is  found  by  solving 

r^di) 

q  =  Pr[6>i  <  9i(q)]  =  /  p(0i\y)d9i.  (3.7) 

•7  —  OO 

The  posterior  median  9i{ 0.5)  is  often  an  adequate  summary  of  the  location  of  the 
posterior  marginal  distribution. 

Formally,  the  choice  between  posterior  means  and  medians  can  be  made  by 
viewing  point  estimation  as  a  decision  problem.  For  simplicity  suppose  that  6  is 
univariate  and  the  action,  a,  is  to  choose  a  point  estimate  for  9.  Let  L(9,  a)  denote 
the  loss  associated  with  choosing  action  a  when  9  is  the  true  state  of  nature.  The 
(posterior)  expected  loss  of  an  action  a  is 


L(a)  =  I  L(9,  a)p{9  \  y)  d9  (3.8) 

Je 

and  the  optimal  choice  is  the  action  that  minimizes  the  expected  loss.  Different  loss 
functions  lead  to  different  estimates  (Exercise  3.1).  For  example,  minimizing  (3.8) 
with  the  quadratic  loss  L{9 ,  a)  =  (9  —  a)2  leads  to  reporting  the  posterior  mean, 
a  =  E  [9  |  y\.  The  linear  loss. 


L(91  a) 


Ci  (a  —  9)  9  <  a 

C2{9  —  a)  9  >  a  ’ 


corresponds  to  a  loss  which  is  proportional  to  Ci  if  we  overestimate  and  to  C2  if  we 
underestimate.  This  function  leads  to  a  such  that 


Pr(0  <  a  |  y) 


C2 

ci  +c2 


C2/C1 
1  +  c2/c  1  ’ 


that  is,  a  =  9  |  J .  so  that  presenting  a  quantile  is  the  optimal  action.  Notice 
that  only  the  ratio  of  losses  is  required.  When  ci  =  C2,  under-  and  overestimation 
are  deemed  equally  hazardous,  and  the  median  of  the  posterior  should  be  reported. 
A  100  xp%  equi-tailed  credible  interval  (0  <  p  <  1)  is  provided  by 


[  9i  ({1  —  p}/2) ,  9t  ({l+p}/2)  ]. 


This  interval  is  the  one  that  is  usually  reported  in  the  majority  of  Bayesian  analyses 
carried  out,  since  it  is  the  easiest  to  calculate.  However,  in  cases  where  the  posterior 
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is  skewed,  one  may  wish  to  instead  calculate  a  highest  posterior  density  (HPD) 
interval  in  which  points  inside  the  interval  have  higher  posterior  density  than  those 
outside  the  interval.  Such  an  interval  is  also  the  shortest  credible  interval. 

Another  useful  inferential  quantity  is  the  predictive  distribution  for  unobserved 
(e.g.,  future)  observations  z.  Under  conditional  independence,  so  that  p(z  \  Q.y)  = 
p(z  |  G),  this  distribution  is 

P(z  |  y)=  f  p{z\  6)p(Q  |  y)  dO.  (3.9) 

J  0 

This  derivation  clearly  assumes  that  the  likelihood  for  the  original  data  y  is  also 
appropriate  for  the  unobserved  observations  z. 

The  Bayesian  approach  therefore  provides  very  natural  inferential  summaries. 
However,  these  summaries  require  the  evaluation  of  integrals,  and  for  most  models, 
these  integrals  are  analytically  intractable.  Methods  for  implementation  are  consid¬ 
ered  in  Sects.  3.7  and  3.8. 


3.3  Asymptotic  Properties  of  Bayesian  Estimators 

Although  Bayesian  purists  would  not  be  concerned  with  the  frequentist  properties 
of  Bayesian  procedures,  personally  I  find  it  reassuring  if,  for  a  particular  model,  a 
Bayesian  estimator  can  be  shown  to  be,  as  a  minimum,  consistent.  Efficiency  is  also 
an  interesting  concept  to  examine. 

We  informally  give  a  number  of  results,  before  referencing  more  rigorous 
treatments.  We  only  consider  parameter  vectors  of  finite  dimension.  An  important 
condition  that  we  assume  in  the  following  is  that  the  prior  distribution  is  positive  in 
a  neighborhood  of  the  true  value  of  the  parameter. 

The  famous  Bernstein-von  Mises  theorem  states  that,  with  increasing  sample 
size,  the  posterior  distribution  tends  to  a  normal  distribution  whose  mean  is  the 
MLE  and  whose  variance-covariance  matrix  is  the  inverse  of  Fisher’s  information. 
Let  6  be  the  true  value  of  a  p-dimensional  parameter,  and  suppose  we  are  in  the 
situation  in  which  the  data  are  independent  and  identically  distributed.  Denote  the 
posterior  mean  by  6n  =  Gn(Yn)  =  E[0  |  Yn\  and  the  MLE  by  G Then, 

y/n(  9n  —  G)  =  s/n{  Gn  -  9n)  +  y/n{9n  -  9) 

and  we  know  that  i/n(Gn  —  G)  -Ad  Np[0, I(G)~1},  where  1(9)  is  the  information 
in  a  sample  of  size  1  (Sect.  2.4.1).  It  can  be  shown  that  \Jn(9n  —  0n)  —tp  0  and  so 

V^(9n-G)  -^d  Np[0, I(G)~1]. 

Hence,  6n  is  ^/n-consistent  and  asymptotically  efficient.  It  is  important  to  empha¬ 
size  that  the  effect  of  the  prior  diminishes  as  n  — >  oo.  As  van  der  Vaart  (1998, 
p.  140)  dryly  notes,  “Apparently,  for  an  increasing  number  of  observations  one’s 
prior  beliefs  are  erased  (or  corrected)  by  the  observations.” 
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The  Bernstein-von  Mises  theorem  is  so-called  because  of  the  papers  by  Bernstein 
(1917)  and  von  Mises  (1931),  though  the  theorem  has  been  refined  by  a  number  of 
authors.  For  references  and  a  recent  treatment,  see  van  der  Vaart  (1998,  Sect.  10.2). 
An  early  paper  on  consistency  of  Bayesian  estimators  is  Doob  (1948)  and  again 
there  have  been  many  refinements;  see  van  der  Vaart  (1998,  Sect.  10.4).  An 
important  assumption  is  that  the  parameter  space  is  finite.  Diaconis  and  Freedman 
(1986)  describe  the  problems  that  can  arise  in  the  infinite-dimensional  case. 


3.4  Prior  Choice 

The  specification  of  the  prior  distribution  is  clearly  a  necessary  and  crucial  aspect  of 
the  Bayesian  approach.  With  respect  to  prior  choice,  an  important  first  observation 
is  that  for  all  6  for  which  n(0)  =  0,  we  necessarily  have  p(6  \  y)  =  0,  regardless  of 
any  realization  of  the  observed  data,  which  clearly  illustrates  that  great  care  should 
be  taken  in  excluding  parts  of  the  parameter  space  a  priori. 

We  distinguish  between  two  types  of  prior  specification.  In  the  first,  which  we 
label  as  baseline  prior  specification,  we  presume  an  analysis  is  required  in  which 
the  prior  distribution  has  “minimal  impact,”  so  that  the  information  in  the  likelihood 
dominates  the  posterior.  An  alternative  label  for  such  an  analysis  is  objective  Bayes. 
For  an  interesting  discussion  of  the  merits  of  this  approach,  see  Berger  (2006).  Other 
labels  that  have  been  put  forward  for  such  prior  specification  include  reference,  non- 
informative  and  nonsubjective.  Such  priors  may  be  used  in  situations  (for  example, 
in  a  regulatory  setting)  in  which  one  must  be  as  “objective”  as  possible.  There  is 
a  vast  literature  on  the  construction  of  objective  Bayesian  procedures,  with  an  aim 
often  being  to  define  procedures  which  have  good  frequentist  properties. 

An  analysis  with  a  baseline  prior  may  be  the  only  analysis  performed  or, 
alternatively,  may  provide  an  analysis  with  which  other  analyses  in  which  substan¬ 
tive  priors  are  specified  may  be  compared.  Such  substantive  priors  constitute  the 
second  type  of  specification  in  which  the  incorporation  of  contextual  information 
is  required.  Once  we  have  a  candidate  substantive  prior,  it  is  often  beneficial  to 
simulate  hypothetical  data  sets  from  the  prior  and  examine  these  realizations  to  see 
if  they  conform  to  what  is  desirable.  A  popular  label  for  analyses  for  which  the 
priors  are,  at  least  in  part,  based  on  subject  matter  information  is  subjective  Bayes. 


3.4.1  Baseline  Priors 

On  first  consideration  it  would  seem  that  the  specification  of  a  baseline  prior  is 
straightforward  since  one  can  take 

7T(0)ocl,  (3.10) 

so  that  the  posterior  distribution  is  simply  proportional  to  the  likelihood  p(y  \  6). 
There  are  two  major  difficulties  with  the  use  of  (3.10),  however. 
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The  first  difficulty  is  that  (3.10)  provides  an  improper  specification  (i.e.  it  does 
not  integrate  to  a  positive  constant  <  oo)  unless  the  range  of  each  element  of  6 
is  finite.  In  some  instances  this  may  not  be  a  practical  problem  if  the  posterior 
corresponding  to  the  prior  is  proper  and  does  not  exhibit  any  aberrant  behavior 
(examples  of  such  behavior  are  presented  shortly).  A  posterior  arising  from  an 
improper  prior  may  be  justified  as  a  limiting  case  of  proper  priors,  though  some 
statisticians  are  philosophically  troubled  by  this  argument.  Another  justification  for 
an  improper  prior  is  that  such  a  choice  may  be  thought  of  as  approximating  a  prior 
that  is  “locally  uniform’’  close  to  regions  where  the  likelihood  is  non-negligible  (so 
that  the  likelihood  dominates)  and  decreasing  to  zero  outside  of  this  region.  Great 
care  must  be  taken  to  ensure  that  the  posterior  corresponding  to  an  improper  prior 
choice  is  proper.  For  nonlinear  models,  for  example,  improper  priors  should  never  be 
used  (as  an  example  shortly  demonstrates).  It  is  difficult  to  give  general  guidelines  as 
to  when  a  proper  posterior  will  result  from  an  improper  prior.  For  example,  improper 
priors  for  the  regression  parameters  in  a  generalized  linear  model  (which  are 
considered  in  detail  in  Chap.  6)  will  often,  but  not  always,  lead  to  a  proper  posterior. 

Example:  Binomial  Model 

Suppose  Y  |  p  ~  Binomial(n,  p),  with  an  improper  uniform  prior  on  the  logit  of  p, 
which  we  denote  6  =  log[p/ (1  —  p)\.  Then,  w(0)  oc  1  implies  a  prior  on  p  of 


7r(p)  oc  [p(l  -  p)]  \ 


which  is,  of  course,  also  improper. 1  With  this  prior  an  improper  posterior  results  if 
y  =  0  (or  y  =  n)  since  the  non-integrable  spike  at  p  =  0  (or  p  1)  remains  in  the 
posterior.  Note  that  this  prior  results  in  the  MLE  being  recovered  as  the  posterior 
mean. 


Example:  Nonlinear  Regression  Model 

To  illustrate  the  non-propriety  in  a  nonlinear  situation,  consider  the  simple  model 


Yi  |  6  ~ind  N  [exp {-0xi),a2]  , 


(3.11) 


for  i  =  1, ...  ,n,  with  9  >  0  and  a2  assumed  known.  With  an  improper  uniform 
prior  on  9, 7 r(0)  =  1,  we  label  the  resulting  (unnormalized)  “posterior”  as 


'This  prior  is  sometimes  known  as  Haldane’s  prior  (Haldane  1948). 
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As  9  — >  00, 


q(6  |  y )  ->  exp 


1 

2^ 


n 


(3.12) 


a  constant,  so  that  the  posterior  is  improper,  because  the  tail  is  non-integrable,  that  is. 


q{o  I  y) 


=  oo 


for  all  9C  >  0.  Intuitively,  the  problem  is  that  as  8  — >  oo  the  corresponding  nonlinear 
curve  does  not  move  increasingly  away  from  the  data,  but  rather  to  the  asymptote 
E[Y  |  8]  =  0.  The  result  is  that  a  finite  sum  of  squares  results  in  (3.12),  even 
in  the  limit.  By  contrast,  there  are  no  asymptotes  in  a  linear  model,  and  so  as  the 
parameters  increase  or  decrease  to  ±oo,  the  fitted  line  moves  increasingly  far  from 
the  data  which  results  in  an  infinite  sum  of  squares  in  the  limit,  in  which  case  the 
likelihood,  and  therefore  the  posterior,  is  zero.  □ 


To  summarize,  it  is  ill-advised  to  think  of  improper  priors  as  a  default  choice. 
Rather,  improper  priors  should  be  used  with  care,  and  it  is  better  to  assume  that  they 
will  lead  to  problems  until  the  contrary  can  be  shown.  The  safest  strategy  is  clearly 
to  specify  proper  priors,  and  this  is  the  approach  generally  taken  in  this  book. 

The  second  difficulty  with  (3.10)  is  that  if  we  reparameterize  the  model  in  terms 
of  <f>  =  g{6),  where  </(•)  is  a  one-one  mapping,  then  the  prior  for  4>  corresponding 
to  (3.10)  is 


n(4>) 


dO 

dtp 


so  that,  unless  g  is  a  linear  transformation,  the  prior  is  no  longer  constant.  We 
have  just  seen  an  example  of  this  with  the  binomial  model.  As  another  example, 
consider  a  variance  u2,  with  prior  7r(cr2)  cx  1.  This  choice  implies  a  prior  for  the 
standard  deviation,  n(a)  cx  cr,  which  is  nonconstant.  The  problem  is  that  we  cannot 
be  “flat”  on  different  nonlinear  scales.  This  issue  indicates  that  a  desirable  property 
in  constructing  baseline  priors  is  their  invariance  to  parameterization  in  order  to 
obtain  the  same  prior  regardless  of  the  starting  parameterization. 

A  number  of  methods  have  been  proposed  for  the  specification  of  baseline  or 
non-informative  priors  (we  avoid  the  latter  term  since  it  is  arguable  that  priors  are 
ever  non-informative).  Jeffreys  (1961,  Sect.  3.10)  suggested  the  use  of 

tt(6>)  oc  |jf(0)|1/2  ,  (3.13) 


where  1(9)  is  Fisher’s  expected  information.  This  prior  has  the  desirable  property 
of  invariance  to  reparameterization.  The  invariance  holds  in  general  but  is  obvious 
in  the  case  of  univariate  9.  If  cp  =  g(9). 


IM  =  I°(0)  x 


d9 

dcp 


2 


(3.14) 
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where  the  subscripts  now  emphasize  the  parameterization.  Consequently,  if  we 
start  with 


^ u{4>)  oc  h{4>)1/2 


this  implies 


from  (3.14).  Hence,  prior  (3.13)  results  if  we  use  the  prescription  of  Jeffreys,  but 
begin  with  </>.  In  the  case  of  Y  \  p  ~  Binomialfn. p)  the  information  is  I (p)  = 
n/\p(  1  —  p)\  (Sect.  2.4.1).  Therefore,  Jeffreys  prior  is  n(p)  oc  [p(  1  —  p)]-1/2.  This 
prior  has  the  advantage  of  producing  a  proper  posterior  when  y  =  0  or  y  =  n,  a 
property  not  shared  by  Haldane’s  prior. 

Unfortunately,  the  application  of  the  above  procedure  to  multivariate  6  can  lead 
to  posterior  distributions  that  have  undesirable  characteristics.  For  example,  in  the 
Neyman-Scott  problem,  the  use  of  Jeffreys  prior  gives,  as  n  — >  oo,  a  limiting 
posterior  mean  that  is  inconsistent,  in  a  frequentist  sense  (see  Exercise  3.3). 

A  refinement  of  Jeffreys  approach  for  selecting  priors  on  a  more  objective  basis  is 
provided  by  reference  priors.  We  briefly  describe  this  approach  heuristically;  more 
detail  can  be  found  in  Bernardo  (1979)  and  Berger  and  Bernardo  (1992).  For  any 
prior/likelihood  distribution,  suppose  we  can  calculate  the  expected  information 
concerning  a  parameter  of  interest  that  will  be  provided  by  the  data.  The  more 
informative  the  prior,  the  less  information  the  data  will  provide.  An  infinitely  large 
sample  would  provide  all  of  the  missing  information  about  the  quantity  of  interest, 
and  the  reference  prior  is  chosen  to  maximize  this  missing  information. 


3.4.2  Substantive  Priors 

The  specification  of  substantive  priors  is  obviously  context  specific,  but  we  give  a 
number  of  general  considerations.  Specific  models  will  be  considered  in  subsequent 
chapters.  In  this  section  we  will  discuss  some  general  techniques  but  will  not 
describe  prior  elicitation  in  any  great  detail;  see  Kadane  and  Wolfson  (1998), 
O’Hagan  (1998),  and  Craig  et  al.  (1998)  and  the  ensuing  discussion  for  more  on 
this  topic  which  can  be  thought  of  as  the  measurement  of  probabilities. 

When  specifying  a  substantive  prior,  it  is  obvious  that  we  need  a  clear  under¬ 
standing  of  the  meaning  of  the  parameters  of  the  model  for  which  we  are  specifying 
priors,  and  this  can  often  be  achieved  by  reparameterization. 


Example:  Linear  Regression 

Consider  the  simple  linear  regression  E[Y  |  a;]  =  70  +  71a;.  Interpretation  is  often 
easier  if  we  reparameterize  as 
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E[y  I  z]  =  Po  +pi(z  -  z) 

where  z  =  c  x  x  and  c  is  chosen  so  that  the  units  of  2  are  convenient.  Under 
this  parameterization,  /?o  is  the  expected  response  at  2  =  z.  It  will  often  be 
easier  to  specify  a  prior  for  /3q  than  for  70,  the  average  response  at  x  =  0,  which 
may  be  meaningless.  The  slope  parameter,  /3i,  is  the  change  in  expected  response 
corresponding  to  a  c-unit  increase  in  x  (1-unit  increase  in  z). 


Example:  Exponential  Regression 

It  may  be  easier  to  specify  priors  on  observable  quantities,  before  transforming  back 
to  the  parameters.  For  the  nonlinear  model  (3.1 1),  we  might  specify  a  prior  for  the 
expected  response  at  x  =  x,  (j>  =  exp(— 9  x)  to  give  a  prior  7 r^(0).  The  prior  for  0  is 

7 r„(0)  =  74  [exp(— Ox)  ]  x  -rexp(— Ox), 

the  last  term  corresponding  to  the  Jacobian  of  the  transformation  </>  — >  9.  As  an 
example,  one  might  assume  a  Be(a,  b)  prior  for  </>,  with  a  and  b  chosen  to  give  a 
90%  interval  for  0.  □ 

While  the  axioms  of  probability  are  uncontroversial,  the  interpretation  of  proba¬ 
bility  has  been  contested  for  centuries.  In  the  frequentist  approach  of  Chap.  2,  prob¬ 
ability  was  defined  in  an  objective  frequentist  sense.  If  the  event  A  is  of  interest  and 
an  experiment  is  repeated  n  times  resulting  in  ua  occasions  on  which  A  occurs,  then 

P(A)  =  lim  — . 

n— ycc  n 

In  contrast,  in  the  subjective  Bayesian  worldview,  probabilities  are  viewed  as  sub¬ 
jective  and  conditional  upon  an  individual’s  experiences  and  knowledge,  although 
one  may  of  course  base  subjective  probabilities  upon  frequencies.  Cox  and  Hinkley 
(1974,  p.  53)  state,  with  reference  to  the  use  of  Bayes  theorem,  “If  the  prior 
distribution  arises  from  a  physical  random  mechanism  with  known  properties, 
this  argument  is  entirely  uncontroversial,”  but  continue,  “A  frequency  prior  is, 
however,  rarely  available.  To  apply  the  Bayesian  approach  more  generally  a  wider 
concept  of  probability  is  required  . . .  the  prior  distribution  is  taken  as  measuring  the 
investigator’s  subjective  opinion  about  the  parameter  from  evidence  other  than  the 
data  under  analysis.” 

As  alluded  to  by  this  last  quote,  an  obvious  procedure  is  to  base  the  prior 
distribution  upon  previously  collected  data.  Ideally,  preliminary  modeling  of  such 
data  should  be  carried  out  to  acknowledge  sampling  error.  If  one  believed  that  the 
data-generation  mechanism  for  both  sets  of  data  was  comparable,  then  it  would 
be  logical  to  base  the  posterior  on  the  combined  data  (and  then  once  again  one 
has  to  decide  on  how  to  pick  a  prior  distribution).  Often  such  comparability  is  not 
reasonable,  and  a  conservative  approach  is  to  take  the  prior  as  the  posterior  based 
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on  the  additional  data,  but  with  an  inflated  variance,  to  accommodate  the  additional 
uncertainty.  This  approach  acknowledges  nonsystematic  differences,  but  systematic 
differences  (in  particular,  biases  in  one  or  both  studies)  may  also  be  present,  and  this 
is  more  difficult  to  deal  with. 

Roughly  speaking,  so  long  as  the  prior  does  not  assign  zero  mass  to  any  region, 
the  likelihood  will  dominate  with  increasing  sample  size  (as  we  saw  in  Sect.  3.3),  so 
that  prior  choice  becomes  decreasingly  important.  A  very  difficult  problem  in  prior 
choice  is  the  specification  of  the  joint  distribution  over  multiple  parameters.  In  some 
contexts  one  may  be  able  to  parameterize  the  model  so  that  one  believes  a  priori  that 
the  components  are  independent,  but  in  general  this  will  not  be  possible. 

Due  to  the  difficulties  of  prior  specification,  a  common  approach  is  to  carry  out 
a  sensitivity  analysis  in  which  a  range  of  priors  are  considered  and  the  robustness 
of  inference  to  these  choices  is  examined.  An  alternative  is  to  model  average  across 
the  different  prior  models;  see  Sect.  3.10. 


3.4.3  Priors  on  Meaningful  Scales 

As  we  will  see  in  Chaps.  6  and  7,  loglinear  and  linear  logistic  forms  are  extremely 
useful  regression  models,  taking  the  forms 


log  n  =  /30  +  fiixi  +  . . .  +  PkXk 


log  ^ 


retrospectively,  where  /i  =  E \Y\.  Both  forms  are  examples  of  generalized  linear 
models  (GLMs)  which  are  discussed  in  some  detail  in  Chap.  6. 

Often  there  will  be  sufficient  information  in  the  data  for  (3  =  [/3q,  /3i, . . . ,  /3k]T 
to  be  analyzed  using  independent  normal  priors  with  large  variances  (unless,  for 
example,  there  are  many  correlated  covariates).  The  use  of  an  improper  prior  for  /3 
will  often  lead  to  a  proper  posterior  though  care  should  be  taken.  Chapter  5  discusses 
prior  choice  for  the  linear  model  and  Chap.  6  for  GLMs,  and  Sect.  6.8  provides  an 
example  of  an  improper  posterior  that  arises  in  the  context  of  a  Poisson  model  with 
a  linear  link. 

If  we  wish  to  use  informative  priors  for  (3,  we  may  specify  independent  normal 
priors,  with  the  parameters  for  each  component  being  obtained  via  specification  of 
two  quantiles  with  associated  probabilities.  For  loglinear  and  logistic  models,  these 
quantiles  may  be  given  on  the  exponentiated  scale  since  these  are  more  interpretable 
(as  the  rate  ratio  and  odds  ratio,  respectively).  If  01;  02  are  the  quantiles  and  pi,p2 
are  the  associated  probabilities,  then  the  parameters  of  the  normal  prior  are 


f*  = 


zi02  -  z20i 

Zl  -  z 2 


(3.15) 


a  = 


Zl  ~  Z2 


(3.16) 
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Fig.  3.1  The  beta  prior, 
Be(2.73,  5.67),  which  gives 
Pr(p  <  0.1)  =  0.05  and 
Pr(p  <  0.6)  =  0.95 


Probability,  p 


where  z \  and  z-i  are  the  quantiles  of  a  standard  normal  random  variable.  For 
example,  in  an  epidemiological  context  with  a  Poisson  regression  model,  we  may 
wish  to  specify  a  prior  on  a  relative  risk  parameter,  exp(/3i)  which  has  a  median  of  1 
(corresponding  to  no  association)  and  a  95%  point  of  3  (if  we  think  it  is  unlikely  that 
the  relative  risk  associated  with  a  unit  increase  in  exposure  exceeds  3).  If  we  take 
0i  =  log(l)  and  02  =  log(3),  along  withpi  =  0.5  andp2  =  0.95,  then  we  obtain 
(3 1  ~  N(0, 0.6682).  In  general,  less  care  is  required  in  prior  choice  for  intercepts  in 
GLMs  since  they  are  very  accurately  estimated  with  even  small  amounts  of  data. 

Many  candidate  prior  distributions  contain  two  parameters.  For  example,  a  beta 
prior  may  be  used  for  a  probability  and  lognormal  or  gamma  distributions  may  be 
used  for  positive  parameters  such  as  measures  of  scale.  A  convenient  way  to  choose 
these  parameters  is  to,  as  above,  specify  two  quantiles  with  associated  probabilities 
and  then  solve  for  the  two  parameters.  For  example,  suppose  we  wish  to  specify  a 
beta  prior,  Be(ai,  02),  for  a  probability  p,  such  that  the  ]>\  and  P2  quantiles  are  qi 
and  c[2  ■  Then  we  may  solve 

[pi  -  Pr(p  <  <71  |  01,  a2)]2  +  [P2  ~  Pr (p  <  q2  |  an,  a2)]2  =  0 

for  01,02.  For  example,  taking  pi  =  0.05, P2  =  0.95,  q\  =  0.1,92  =  0.6  yields 
ai  =  2.73,  <3,2  =  5.67,  and  Fig.  3.1  shows  the  resulting  density. 


3.4.4  Frequentist  Considerations 

We  briefly  give  a  simple  example  to  illustrate  the  frequentist  bias-variance  trade-off 
of  prior  specification,  by  examining  the  mean  squared  error  (MSE)  of  a  Bayesian 
estimator.  Consider  data  Y),  i  =  1, . . .  ,n,  with  Y,  independently  and  identically 
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h  ig.  3.2  Mean  squared  error 

\ 

of  the  posterior  mean  _ 

\  ' 

estimator  when 

\  ' 

\fn{y n  -  fi)  ->d  N(0,  a2)  |  p  _ 

\  ' 

\  1 

with  cr2  known  and  prior  111  1X1 

\  ' 

/i  ~  N(m,  v).  The  dashed  cu 

\  ' 

line  represents  the  case  with 

V  > 

v  =  1  and  the  dotted  line  w  0 

\  ' 

when  v  =  3,  as  a  function  of  S’- 

’  ■  N  ■  .  .  ' 

the  parameter  /r.  The  mean  ^ 

squared  error  of  the  sample 

mean  is  the  solid  horizontal 

line  0 — 

h  1  1  1  1  1  r 

-3-2-10  1  2  3 

b 

distributed  with  E[Yi  \  fj]  =  /r  and  var(F,  \  fj.)  =  a2  with  cr2  known.  The  asymptotic 
distribution  of  the  sample  mean  is 

Vn(Yn  -  /r)  N(0,  cr2). 

We  treat  this  distribution  as  the  likelihood  and  examine  a  Bayesian  analysis  with 
prior 

/r  ~  N(m,  v). 

The  posterior  is 


Yn  -4 d  N 


(1  —  wn)m,  Wn  — 
n 


where 

nv 

Wn  =  — 1 — 2 
nv  +  crz 


We  first  observe  that  the  posterior  mean  estimator  is  consistent  since  wn  — >  1  as 
n  — >  oo),  so  long  as  v  >  0.  but  the  estimator  has  finite  sample  bias  if  v-1  7^  0.  The 
mean  squared  error  of  the  posterior  mean  estimator  is 


MSE  =  Variance  +  Bias2 

cr2  2 

=  wn - b  [wnn  +  (1  -  Wn)m  -  n}~ 

n 

^_2 

=  wn - b  (1  -  wn)2(m  -  n)2. 

n 

Figure  3.2  illustrates  the  MSE  as  a  function  of  /i  for  two  different  prior  distributions 
that  are  both  centered  at  zero  but  have  different  variances  of  v  =  1, 3.  For  simplicity 
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we  have  chosen  a2 /n  =  1  with  n  =  9  (so  that  the  MSE  of  the  sample  mean  is  1 ,  and 
is  indicated  as  the  solid  horizontal  line).  The  trade-off  when  specifying  the  variance 
of  the  prior  is  clear;  if  the  true  //  is  close  to  m,  then  reductions  in  MSE  are  achieved 
with  a  small  v,  though  the  range  of  fi  over  which  an  improved  MSE  is  achieved  is 
narrower  than  with  the  wider  prior.  At  values  of  /r  of  m  ±  \Jv  +  a2 /n,  the  MSE  of 
the  sample  mean  and  Bayesian  estimator  are  equal.  The  variance  of  the  estimator  is 
given  by  the  lowest  point  of  the  MSE  curves,  and  the  bias  dominates  for  large  \n\. 


Example:  Lung  Cancer  and  Radon 

As  an  example  of  prior  specification,  we  return  to  the  simple  model  considered 
repeatedly  in  Chap.  2  with  likelihood 

Yi  |  f3  ~ind  Poisson  [  Et  exp (/?0  +  fcxi)  ]  , 

where  recall  that  Yi  are  counts  of  lung  cancer  incidence  in  Minnesota  in  1998— 
2002,  and  x,  is  a  measure  of  residential  radon  in  county  i,  i  =  1 ,,n.  The 
obvious  improper  prior  here  is  7t(/3)  oc  1  (and  results  in  a  proper  posterior  for  this 
likelihood). 

To  specify  a  substantive  prior,  we  need  to  have  a  clear  interpretation  of  the  pa¬ 
rameters,  and  /3q  and  f3\  are  not  the  most  straightforward  quantities  to  contemplate. 
Hence,  we  reparameterize  the  model  as 

Yi  |  9  ~ind  Poisson  (Erf, o^-^)  , 


where  0  =  [90, 9i]T  so  that 

0O  =  E [Y/E  \  x  =  x]=  exp(/30  +  Pix) 

is  the  expected  standardized  mortality  ratio  in  an  area  with  average  radon.  The 
standardization  that  leads  to  expected  numbers  E  implies  we  would  expect  0a  to 
be  centered  around  1.  The  parameter  9 1  =  exp(/3i)  is  the  relative  risk  associated 
with  a  one-unit  increase  in  radon.  Due  to  ecological  bias,  studies  often  show 
a  negative  association  between  lung  cancer  incidence  and  radon  (and  it  is  this 
ecological  association  we  are  estimating  for  this  illustration  and  not  the  individual- 
level  association).  We  take  lognormal  priors  for  9q  and  ()\  and  use  (3.15)  and  (3.16) 
to  deduce  the  lognormal  parameters.  For  9q  we  take  a  lognormal  prior  with  2.5% 
and  97.5%  quantiles  of  0.67  and  1.5  to  give  p,  =  0,cr  =  0.21.  For  9 1  we  assume 
the  relative  risk  associated  with  a  one-unit  increase  in  radon  is  between  0.8  and  1.2 
with  probability  0.95,  to  give  /r  =  —0.02,  cr  =  0.10.  We  return  to  this  example  later 
in  the  chapter. 
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The  behavior  of  Bayesian  estimators  under  misspecification  of  the  likelihood  has 
received  less  attention  than  frequentist  estimators.  Recall  the  result  concerning  the 
behavior  of  the  MLE  9n  under  model  misspecification  summarized  in  (2.27),  which 
we  reproduce  here: 


where 


Vn  (9n  -  9T)  -»• d  Np  [0,  J~1K{  J1)"1] 

92 

J  —  Et 

seae ' loei,(y  1 

\(  d  \ 

/  d  \T1 

K  =  Et 

(^logpcno,)) 

with  9t  the  true  9  and  p(Y  \  9)  the  assumed  model.  Let  9 „  =  E [9  \  Y„]  be  the 
posterior  mean  which  we  here  view  as  a  function  of  Yn  =  [V) , . . . .  Yn]T.  From 
Sect.  3.3,  \fn[9n  —  9n)  -p-p  0,  and  hence 

Vn{9n  -  9t)  -> d  Np  [0,  J-'Kir)-1]  . 


This  has  important  implications  since  it  shows  that,  asymptotically,  the  effect  of 
model  misspecification  on  the  posterior  mean  is  the  same  as  its  effect  on  the  MLE.  If 
the  likelihood  is  of  linear  exponential  family  form,  correct  specification  of  the  mean 
function  leads  to  consistent  estimation  of  the  parameters  in  the  mean  model  (see 
Sect.  6.5.1  for  details).  As  with  the  reported  variance  of  the  MLE,  the  spread  of  the 
posterior  distribution  could  be  completely  inappropriate,  however.  While  sandwich 
estimation  can  be  used  to  “correct”  the  variance  estimator  for  the  MLE,  there  is  no 
such  simple  solution  for  the  posterior  mean,  or  other  Bayesian  summaries. 

With  respect  to  model  misspecification,  the  emphasis  in  the  Bayesian  literature 
has  been  on  sensitivity  analyses,  or  on  embedding  a  particular  likelihood  or  prior 
choice  within  a  larger  class.  Embedding  an  initial  model  within  a  continuous  class 
is  a  conceptually  simple  approach.  For  example,  a  Poisson  model  may  be  easily 
extended  to  a  negative  binomial  model. 

A  difficulty  with  considering  model  classes  with  large  numbers  of  unknown 
parameters  is  that  uncertainty  on  parameters  of  interest  will  be  increased  if  a  simple 
model  is  closer  to  the  truth.  In  particular,  model  expansion  may  lead  to  a  decrease  in 
precision,  as  we  now  illustrate.  As  we  have  seen,  as  n  increases,  the  prior  effect  is 
negligible  and  the  posterior  variance  is  given  by  the  inverse  of  Fisher’s  information, 
(Sect.  3.3).  Suppose  that  we  have  k  parameters  in  an  original  model,  and  we  are 
considering  an  expanded  model  with  p  parameters,  and  let 


I\1  I\2 
I 21  I 22 
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where  In  is  a  k  x  k  matrix  corresponding  to  the  information  on  the  parameters 
of  the  simpler  model  (which  includes  the  parameters  of  interest),  and  J22  is  the 
{p  —  k)  x  (p  —  k)  information  matrix  concerning  the  additional  parameters  in  the 
enlarged  model.  In  the  simpler  model,  the  information  on  the  parameters  of  interest 
is  In,  while  for  the  enlarged  model,  it  is 

-fn  —  I12I22  i^i  ? 

which  is  never  greater  than  In.  This  is  an  oversimplified  discussion  (as  we  shall 
see  in  Sect.  5.9),  but  it  highlights  that  there  can  be  a  penalty  to  pay  for  specifying  an 
overly  complex  model. 


3.6  Bayesian  Model  Averaging 

If  a  discrete  number  of  models  are  considered,  then  model  averaging  provides  an 
alternative  means  of  assessing  model  uncertainty.  The  Bayesian  machinery  handles 
multiple  models  in  a  very  straightforward  fashion  since  essentially  the  unknown 
model  is  treated  as  an  additional  discrete  parameter.  Let  , . . . ,  Mj  denote  the  J 
models  under  consideration  and  Gj  the  parameters  of  the  jth  model.  Suppose,  for 
illustration,  there  is  a  parameter  of  interest  4>  (which  we  assume  is  univariate)  that 
is  well  defined  for  each  of  the  J  models  under  consideration.  The  posterior  for  (j>  is 
a  mixture  over  the  J  individual  model  posteriors: 


P{4>  I  V)  =  ^P^  I  Mj,y)  Pr (Mj  \  y) 

3= 1 


where 


p{(f)  \Mj,y)  =  j  p((j)  |  0 I  j .  y  j/n  0 ,  \  M3,y) 


dO . 


Pr  (Mj  |  y) 


p(^y  ^ ^  J pO  I  1  Mj ,y)p(y  \  0i..\Il)p:,0i  \  Mj)  d03, 

P(V  I  Mj)  pr (Mj) 
p{y ) 

Jp(y  I  0j,Mj)p(0j  I  Mj)  dOj  Pi(Mj) 

p{y) 


and  with  Pr(M7)  the  prior  belief  in  model  j  and  p(6j  \  Mj)  the  prior  on  the 
parameters  of  model  Mj .  The  marginal  probabilities  of  the  data  under  the  different 
models  are  calculated  as 

p(y  I  M3 )  =  J p(y\  0j,Mj)p(0j  I  Mj)  dOj , 
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with 


p(y)  = 

.7=1 

To  summarize  the  posterior  for  <p,  we  might  report  the  posterior  mean 

J 

e[<^  I  y]  =  I  y ’  x  Pr(Mi  I  y). 

7=1 

which  is  simply  the  average  of  the  posterior  means  across  models,  weighted  by  the 
posterior  weight  received  by  each  model.  The  posterior  variance  is 

J 

var ((/>  |  y)  =  ^  var(c/>  |  y,My)  x  Pr(My  |  y) 
i= i 

J 

+Y1  {E^  I y>  mj]  -  I  y]}2  x  Pr(Mf  I  y ) 

7=1 

which  averages  the  posterior  variances  concerning  (j>  in  each  model,  with  the 
addition  of  a  term  that  accounts  for  between-model  uncertainty  in  the  mean. 

Although  model  averaging  is  very  appealing  in  principle,  in  practice  there  are 
many  difficult  choices,  including  the  choice  of  the  class  of  models  to  consider  and 
the  priors  over  both  the  models  and  the  parameters  of  the  models.  Summarization 
can  also  be  difficult  because  the  parameter  of  interest  may  have  different  interpre¬ 
tations  in  different  models.  For  example,  in  a  regression  setting,  suppose  we  fit  the 
single  model 

E[Y  |  xi,x2\  =  A>  +  0ixi  +  /32x2 

with  /3i  the  parameter  of  interest.  The  interpretation  of  (3i  is  as  the  average  change  in 
response  corresponding  to  a  unit  increase  in  x\ ,  with  x2  held  constant.  If  we  average 
over  this  model  and  the  model  with  X\  only,  then  the  usual  “x2  held  constant” 
qualifier  is  not  accurate,  so  a  phrase  such  as  “allowing  for  the  possibility  of  x2 
in  the  model”  may  be  instead  used.  Performing  model  averaging  over  models  which 
represent  different  scientific  theories  is  also  not  appealing  if  the  search  for  a  causal 
explanation  is  sought.  If  prediction  is  the  aim,  then  model  averaging  is  much  more 
appealing  since  parameter  interpretation  is  often  irrelevant  (see  Chap.  12).  Another 
disadvantage  of  model  averaging  is  that  it  may  encourage  the  user  to  believe  they 
have  accounted  for  “all”  uncertainty  in  which  covariates  to  include  in  the  model 
which  is  a  dangerous  conclusion  to  draw. 
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3.7  Implementation 

In  this  section  we  provide  an  overview  of  methods  for  evaluating  the  integrals 
required  for  performing  Bayesian  inference.  We  begin,  in  Sect.  3.7.1,  by  describing 
so-called  conjugate  situations  in  which  the  prior  and  likelihood  combination  is 
constructed  in  order  for  the  posterior  to  be  of  the  same  form  as  the  prior. 
Unfortunately,  in  a  regression  setting,  conjugate  analyses  are  rarely  available  beyond 
the  linear  model.  In  Sect.  3.7.2  the  analytical  Laplace  approximation  is  described. 
Quadrature  methods  are  considered  in  Sect.  3.7.3  before  we  turn  to  a  method  that 
combines  Laplace  and  numerical  integration  in  a  very  clever  way,  in  Sect.  3.7.4,  to 
give  a  method  known  as  the  integrated  nested  Laplace  approximation  (INLA).  More 
recently  developed  sampling-based  (Monte  Carlo)  approaches  have  transformed 
the  practical  application  of  Bayesian  methods,  and  we  therefore  describe  these 
approaches  in  some  detail.  In  Sect.  3.7.5,  importance  sampling  Monte  Carlo  is 
considered,  and  in  Sects.  3.7.6  and  3.7.7,  direct  sampling  from  the  posterior  is 
described.  MCMC  algorithms  are  particularly  important,  and  to  these  we  devote 
Sect.  3.8. 

Beyond  the  crucial  importance  of  integration  in  Bayesian  inference,  this  material 
is  also  relevant  in  a  frequentist  context.  Specifically,  in  Part  III  of  this  book,  we 
will  consider  nonlinear  and  generalized  linear  mixed  effects  models  for  which 
integration  over  the  random  effects  is  required  in  order  to  obtain  the  likelihood  for 
the  fixed  effects. 


3.7.1  Conjugacy 

So-called  conjugate  prior  distributions  allow  analytical  evaluation  of  many  of  the 
integrals  required  for  Bayesian  inference,  at  least  for  certain  convenient  parameters. 
A  conjugate  prior  is  such  that  p(6  \  y )  and  p(6)  belong  to  the  same  family.  We 
assume  dim(0)  =  p.  This  definition  is  not  adequate  since  it  will  always  be  true  given 
a  suitable  definition  of  the  family  of  distributions.  To  obtain  a  more  useful  class, 
we  first  note  that  if  T(Y )  denotes  a  sujficient  statistic  for  a  particular  likelihood 
p(  ■  |  9),  then 

p{6  |  y)  =  p{6  |  t)  oc  p{t  |  O)p(d). 

This  allows  a  definition  of  a  conjugate  family  in  terms  of  likelihoods  that  admit  a 
sufficient  statistic  of  fixed  dimension. 

The  p-parameter  exponential  family  of  distributions  has  the  form: 

viVi  |  9)  =  f(yi)g{9)  exp  [A {Oyu(yz)} , 

where,  in  general,  \{9)  and  u(y,j  have  the  same  dimension  as  9  and  \(9)  is  called 
the  natural  parameter  (and  in  a  linear  exponential  family,  we  have  u(yi)  =  yj).  For 
n  independent  and  identically  distributed  observations  from  p(-  |  6), 
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Table  3.1  Conjugate  priors  and  associated  posterior  distributions,  for  various 
likelihood  choices 


Prior 

Likelihood 

Posterior 

0  rsj  N(m,  v) 

Y  |  6  ~  N(6 \a2/n) 
cr2  known 

9  |  y  N[icy  +  (1  —  w)m,  wa2 /n\ 
with  w  =  v/(v  +  cr2/n) 

6  Be(a,  b ) 

Y  |  6  ~  Bin(n,  6) 

6  |  y  rv  Be(a  +  y,  b  +  n  —  y) 

6  ~  Ga(a,  b ) 

Y  |  6  ~  Poisson  {8) 

s  I  y  ~  Ga(a  +  y,  b  +  1) 

6  ~  Ga(a,  b) 

Y  |  6  ~  Exp  (6>) 

0\y~  Ga(a  +  y,b+  1) 

p{y  I  0) 


I] 

_i=  1 


g(0)n  exp[\(Gyt(y)}  , 


where 

n 

t{y)  =  ^u(yi)- 

2—1 


The  conjugate  prior  density  is  defined  as 

p(0)  =  c(r),  v)  x  g(6)v  exp  [A(0)tl>] , 
where  r/  and  v  are  specified,  a  priori.  The  resulting  posterior  distribution  is 

p(6  |  y)  =  c(r)  +  n,v  + 1)  x  g(0)r,+n  exp  {A(0)T[i;  +  t{y)]}  , 

demonstrating  conjugacy.  Comparison  with  p(yi  \  6)  indicates  that  rj  may  be  viewed 
as  a  prior  sample  size  giving  rise  to  a  sufficient  statistic  v. 

The  above  derivations  are  often  not  required  if  one  wishes  to  simply  obtain  the 
conjugate  distribution  for  a  given  likelihood,  since  it  can  be  determined  quickly  via 
inspection  of  the  kernel  of  the  likelihood.  The  predictive  distribution  is  often  more 
complex  to  derive,  however,  but  is  straightforward  under  the  above  formulation.  In 
the  case  of  a  conjugate  prior,  for  new  observations  Z  =  \Z\ .....  Zm  ]  arising  as 
an  independent  and  identically  distributed  sample  from  p(Z  \  6),  the  predictive 
distribution  is 


p{z  |  y)  = 


_i= 1 


c[r]  +  n,v  +  t(y)\ 
c[t]  +  n  +  m,v  +  t(y,  z)\ ' 


Table  3.1  gives  the  conjugate  choices  for  a  variety  of  likelihoods. 

Beyond  the  normal  linear  model,  the  direct  practical  use  of  conjugacy  in  a 
regression  setting  is  limited,  but  as  we  will  see  subsequently,  the  material  of  this 
section  is  very  useful  when  implementing  direct  sampling  or  MCMC  approaches. 
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Example:  Binomial  Likelihood 


Suppose  we  have  a  single  observation  from  a  binomial  distribution,  Y  \  9  ~ 
Binomial(n,  9): 

p{y\e)  =  (”) 01,(1 

By  direct  inspection  we  recognize  that  the  conjugate  prior  is  a  beta  distribution,  but 
for  illustration  we  follow  the  more  long-winded  route.  In  exponential  family  form, 


p{y  I  Q) 


(1  -  9)n  exp 


or,  in  terms  of  the  natural  parameter  A  =  A (9)  =  log[0/ (1  —  &)\, 

P(y  I  A)  =  [1  +exp(A)]”nexp(i/A). 

The  conjugate  prior  for  A  is  therefore  identified  as 

7r (A)  =  c(r],v)[  1  +  exp(A)]_I)  exp  [uA] 
so  that  the  prior  for  9  is 


(3.17) 


n(9)  =  c(ij,  u)(l  —  9)v  exp 

nv + 2) 


V  log 


1-9 


1 


9(1  -  9) 


r(v  +  i)r(ri-v  +  i) 


9V~1(  i  -  oy> 


the  Be(a,  b)  distribution  with  parameters  a  =  v,b  =  ry  —  v.  An  interpretation  of 
these  parameters  is  that  a  prior  sample  size  r/  =  a  +  b  yields  the  prior  sufficient 
statistic  v  =  a.  It  follows  immediately  that  the  posterior  is  Be(a  +  y,b  +  n  —  y). 
We  write 


E  [9  |  y\ 


a  +  y 

a  +  b  +  n 

V  .  a 
—w  H - 


(!  -  w) 


where  w  =  n/ (a  +  b  +  n),  so  that  the  posterior  mean  is  a  weighted  combination  of 
the  MLE,  9  =  y/n,  and  the  prior  mean.  Similarly, 


mode  [(9  |  y] 


a  +  y-  1 

a  +  b  +  n  —  2 

y  .  a  —  1 

~w  + - Z — o 

n  a  +  b  —  2 


(1-0 
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where  w*  =  n/(a  +  b+n  —  2),  so  that  the  posterior  mode  is  a  weighted  combination 
of  the  prior  mode  (if  it  exists)  and  the  MLE.  The  choice  of  a  uniform  distribution, 
a  =  b  =  1,  results  in  the  posterior  mode  equaling  the  MLE,  as  expected  in  this 
one-dimensional  example. 

The  marginal  distribution  of  the  data,  given  likelihood  and  prior,  is  the  beta- 
binomial  distribution 


pr(y)  =  (")  X  ^  +  ^  + 

\y J  r{a)r(b)  r(a  +  b  +  n) 

for  y  =  0, . . . ,  n.  If  a  =  b  =  1,  the  prior  predictive  is  uniform  over  the  space  of 
outcomes:  p(y)  =  ( n  +  1)_1  for  y  =  0, 1, . . . ,  n,  in  line  with  intuition. 

The  mean  of  the  prior  predictive  is 

E[y]  =  Ee[E(T  |  0)\  =nx  — , 


with  variance 


var(F)  =  vare[E(Y  |  0)]  +  E„[var(Y  |  9)\  =  nE{9)[\  —  E(6*)]  x 


a  +  b  +  n 
a  +  b+  1 ! 


illustrating  the  overdispersion  relative  to  var(Y  |  9)  =  n9(  1  —  9),  if  n  >  1.  If 
n  =  1,  there  is  no  overdispersion  since  we  have  a  single  Bernoulli  random  variable 
for  which  the  variance  is  always  determined  by  the  mean. 

The  predictive  distribution  for  a  new  trial,  in  which  Z  =  0, 1, . . . ,  m  denotes  the 
number  of  successes  and  m  the  number  of  trials,  is 


P{z  I  V ) 


r(a  +  b  +  n )  r(a  +  b  +  z)T{b  +  n  —  y  +  m  —  z) 

r(a  +  y)r{b  +  n  —  y)  r(a  +  b  +  n  +  in) 


which  is  another  version  of  the  beta-binomial  distribution  and  is  an  overdispersed 
binomial  for  which 


E  [Z  |  y\  =  m  x  E [9  |  y]  =  m  x 


a  +  y 

a  +  b  +  n ’ 


and 


var (Z  |  y) 


m  x  E (9  |  2/)  x  [1  —  E (9  \  y)]  x 


a  +  b  +  n  +  m 
a  +  b  +  n  +  1 


As  n  —>  oo,  with  y/n  fixed,  the  predictive  p(z  \  y)  approaches  the  binomial  distri¬ 
bution  Bin(?n,  y/n).  This  makes  sense  since,  under  correct  model  specification,  for 
large  n  we  effectively  know  9,  and  so  binomial  variability  is  the  only  uncertainty 
that  remains. 
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3. 7.2  Laplace  Approximation 


In  this  section  let 

/OO 

exp  [nh(9)]dQ,  (3.18) 

-OO 

denote  a  generic  integral  of  interest,  and  we  suppose  initially  that  9  is  a  scalar. 
Depending  on  the  form  of  h(-),  (3.18)  can  correspond  to  the  evaluation  of  a  variety 
of  quantities  of  interest  including  p(y)  and  posterior  moments.  The  n  appearing 
in  (3.18)  is  included  solely  to  make  the  asymptotic  arguments  more  transparent. 

Let  9  denote  the  mode  of  h(-).  We  carry  out  a  Taylor  series  expansion  about  9, 
assuming  that  h(-)  is  sufficiently  well  behaved  for  this  operation;  in  particular  we 
assume  that  at  least  two  derivatives  exist.  The  expansion  is 

nh(0)  =  „f 
k= 0 


where  h ^ ( 9 )  represents  the  fcth  derivative  of  h(-)  evaluated  at  9.  Hence, 


I  = 


exp 


'£ 

fc= 0 


( 9-9)k 

k\ 


h{k\9) 


d9 


~ 

poo 

nh ~  .  ~  n 

exp 

nh(0) 

/  exp 

J  —OO 

—  {9){9-9f 

d9 , 


where  we  have  ignored  quadratic  terms  and  above  in  the  Taylor  series  and  exploited 
h^\9)  =  0.  Writing  v  =  —  \/h^‘2\9)  gives  the  estimate 


I  =  exp 


nh(9) 


(?f 


(3.19) 


which  is  known  as  the  Laplace  approximation.  The  error  is  such  that 


-  =  1  +  Otn  ). 

I 


Suppose  we  wish  to  evaluate  the  posterior  expectation  of  a  positive  function  of 
interest  <j>(6),  that  is, 


=  /  exp[log  <j>(6)  +  log p(y  \  9)  +  log7r(6<)  +  log {d6/d^)\  d9 

V  f  exp[log p(y  |  9)  +  log7r(0)]  d9 

f  exp[nft  i(0)]  d9 
f  exp[n/i2(#)]  d9 
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where  the  Jacobian  has  been  included  in  the  numerator  of  the  first  line.  Application 
of  (3.19)  to  numerator  and  denominator  gives 


E  [m  |  y] 


v\  exp\nhi(0i)\ 
Vo  exp[nh0(e0)} 


where  9,  is  the  mode  of  hj(-)  and  Vj  =  — 1  /h^\dj),  j  =  0, 1.  Further, 


E[0(6»)  |  y]  =  E [<£(0)  |  y][l  +  0(n  2)], 


since  errors  in  the  numerator  and  denominator  cancel  (Tierney  and  Kadane  1986). 
If  (f>  is  not  positive  then  a  simple  solution  is  to  add  a  large  constant  to  <f>,  apply 
Laplace’s  method,  and  subtract  the  constant. 

Now  consider  multivariate  Q  with  dim(0)  =  p  and  with  required  integral 

/OO  f‘  oo 

■  ■■  exp[nh(d)j  dd\  ■  ■  ■  ddp. 

-OO  J  —  OO 


The  above  argument  may  be  generalized  to  give  the  Laplace  approximation 


I  =  exp 


nh(0) 


1 1/2 


(3.20) 


where  6  is  the  maximum  of  h .(•)  and  v  is  the  p  x  p  matrix  whose  (*,  j) th  element  is 

d2h 

~  89,89,  ~  ■ 

An  important  drawback  of  analytic  approximations  is  the  difficulty  in  per¬ 
forming  error  assessment,  so  that  in  practice  one  does  not  know  the  accuracy 
of  approximation.  The  evaluation  of  derivatives  can  also  be  analytically  and 
numerically  troublesome.  These  shortcomings  apart,  however,  we  will  see  that  these 
approximations  are  useful  as  components  of  other  approaches,  such  as  the  scheme 
described  in  Sect.  3.7.4,  and  for  suggesting  proposals  for  importance  sampling  and 
MCMC  algorithms. 


3. 7.3  Quadrature 


We  consider  numerical  integration  rules  for  approximating  integrals  of  the  form 
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via  the  weighted  sum 

m 

i  = 

i—1 

where  the  points  ti  and  weights  w;,  define  the  integration  rule.  So-called  Gauss  rules 
are  optimal  rules  (in  a  sense  we  will  define  shortly)  that  are  constructed  to  integrate 
weighted  functions  of  polynomials  accurately.  Specifically,  if  p(t)  is  a  polynomial 
of  degree  2 m  —  1,  then  the  Gauss  rule  (t,n  Wi)  is  such  that 


Wip(ti)  = 

i= % 

It  can  be  shown  that  no  rule  has  this  property  for  polynomials  of  degree  2m,  showing 
the  optimality  of  Gauss  rules.  Different  classes  of  rule  emerge  for  different  choices 
of  weight  function.  We  describe  Gauss-Hermite  rules  that  correspond  to  the  weight 
function 

w(t)  =  exp(— t2)  (3.21) 

which  is  of  obvious  interest  in  a  statistics  context.  If  the  integral  is  of  the  form 


J  w(t)p(t )  dt. 


I  = 


g(t )  exp (— f2)  dt 


and  /(f)  can  be  well  approximated  by  a  polynomial  of  degree  2m  —  1,  we  would 
expect  an  m -point  Gauss-Hermite  rule  to  be  accurate. 

The  points  of  the  Gauss-Hermite  rule  are  the  zeroes  of  the  Hermite  polynomials 
Hm  (f )  with  weights 

2  m~1m\W7r 

'll)'  —  - 

m2[iTm_i(fj)]2‘ 

In  general,  the  points  of  the  rule  need  to  be  located  and  scaled  appropriately. 
Suppose  that  /;  and  cr  are  the  approximate  mean  and  standard  deviation  of  9,  and  let 
f  =  (6  —  p)/ s/2a.  The  integral  of  interest  is 

1  =  J  f  (6)  dO  =  J  g(p  +  v/2crf)v/2(re_t  dt 

and  applying  the  transformation  yields 

m 

1  =  ^2w*g(t*), 

i=  1 


where  w *  =  Wi\/2a  and  t*  =  ft  +  v/2crfj. 
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In  practice  p  and  a  are  unknown  but  may  be  estimated  at  the  same  time  as  /  is 
evaluated  to  give  an  adaptive  Gauss-Hermite  rule  (Naylor  and  Smith  1982). 
Suppose  6  is  two-dimensional,  and  we  wish  to  evaluate 


i  =  J  m  do 


f(0 i,  02)  d02  d6\ 


r(0i)  do  1 


where 

no  1)  =  J /(0i,02)d02. 

We  form 

mi 

T=J2wif*(0u), 

i—1 

with 

m2 


mi  m2 

T='52'52mujf(9U:62j), 

*= 1  i=i 

which  is  known  as  a  Cartesian  Product  rule.  Such  rules  can  provide  very  accurate  in¬ 
tegration  with  relatively  few  points,  but  the  number  of  points  required  is  prohibitive 
in  high  dimensions  since  for  p  parameters  and  m  points,  a  total  of  rnp  points  are 
required.  Consequently,  these  rules  tend  to  be  employed  when  p  <  10. 

In  common  with  the  Laplace  method,  quadrature  methods  do  not  provide  an 
estimate  of  the  error  of  the  approximation.  In  practice,  consistency  of  the  estimates 
across  increasing  grid  sizes  may  be  examined. 


3. 7.4  Integrated  Nested  Laplace  Approximations 

We  briefly  review  the  INLA  computational  approach  which  combines  Laplace 
approximations  and  numerical  integration  in  a  very  efficient  manner;  see  Rue 
et  al.  (2009)  for  a  more  extensive  treatment.  Consider  a  model  with  parameters 

61  that  are  assigned  normal  priors,  with  the  remaining  parameters  being  denoted 

62  with  G  =  dirndl  1 )  and  V  =  dirnd^)-  Assume  for  ease  of  explanation  that 
the  normal  prior  is  centered  at  zero  with  variance-covariance  matrix  S,  NG(0,  N), 
where  S  depends  on  elements  in  02-  Many  models  fall  into  this  class  including 
generalized  linear  models  (Chap.  6)  and  generalized  linear  mixed  models  (Chap.  9). 
The  posterior  is 
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7t(0i,02  I  y)  (X  7r(0i  I  02)7r(02)  W_p{yi  |  Ol,  02) 

1=1 

1  n 

--0T1i:(02)-101  +  ^iogp(yi  I  01,02)  ■ 

i= 1 

(3.22) 

Of  particular  interest  are  the  posterior  univariate  marginal  distributions  7r(0lff  |  y), 
g  =  1, . . . ,  G,  and  7t(02„  |  y),  v  =  1, . . . ,  V.  The  “normal”  parameters  0 1  are  dealt 
with  by  analytical  approximations  (as  applied  to  the  term  in  the  exponent  of  (3.22), 
conditional  on  specific  values  of  02).  Numerical  integration  techniques  are  applied 
to  02,  so  that  V  should  not  be  too  large  for  accurate  inference  (Sect.  3.7.3).  For 
elements  of  0i  we  write 

7t(01  g  I  y)  =  j  tt(0i  I  02,  y)  X  tt(02  I  y)  d02 
which  may  be  evaluated  via  the  approximation 

7t(01  g  I  y)  =  J  g  I  02,  y)  X  7t(02  |  y)  d0‘2 

K 

«  ^  I  02fc) ,  y)  X  7F(0^fe)  I  y)  x  Z\fc  (3.23) 

fc=l 


OC  7t(02)  I  .£(02)  |  1//2  exp 


for  a  set  of  weights  k=  1, ...  .  K .  Laplace  or  related  analytical  approximations 
are  applied  to  carry  out  the  integration  (over  0lfl,  ,g'^g)  required  for  evaluation  of 
7r(0i g  |  02,  y).  To  produce  the  grid  of  points  {02fc\  k  =  1, . . . ,  K}  over  which 
numerical  integration  is  performed,  the  mode  of  7 f(02  |  y)  is  located  and  the 
Hessian  is  approximated,  from  which  the  grid  of  points  {6^\  k  =  1, . . . ,  K } , 
with  associated  weights  Ak,  is  created  and  used  in  (3.23),  as  was  described  in 
Sect.  3.7.3.  The  output  of  1NLA  consists  of  posterior  marginal  distributions,  which 
can  be  summarized  via  means,  variances,  and  quantiles. 


3.7.5  Importance  Sampling  Monte  Carlo 

The  first  sampling-based  technique  we  describe  directly  estimates  the  required  inte¬ 
grals.  To  motivate  importance  sampling  Monte  Carlo,  consider  the  one-dimensional 
integral 

1=  C  f(0)  dd  =  E[/(0)], 

^0 
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where  the  expectation  is  with  respect  to  the  uniform  distribution,  U(0, 1).  This 
formulation  suggests  the  obvious  estimator 


Tm  =  -J2f(eW)’ 

m  z — * 


with  ffW  U(0, 1),  t  =  1, . . . ,  m.  By  the  central  limit  theorem  (Appendix  G), 
y/m(Lm-I )  -+d  N[0,  var(/)], 


where  var (/)  =  E[f(9)2]  —  I2  and  we  have  assumed  the  latter  exists.  The  form  of  the 
variance  reveals  that  the  efficiency  of  the  method  is  determined  by  how  variable  the 
function  /  is,  with  respect  to  the  uniform  distribution  over  [0,1].  If  /  were  constant, 
we  would  have  zero  variance ! 

To  achieve  an  approximately  constant  function,  we  can  trivially  rewrite  the 
integral  as 


1  =  f  m  d9  =  J  M g(9 )  d9  =  Eg 


m 


Ls(«or 


(3.24) 


where  we  no  longer  restrict  9  to  lie  in  (0,  1).  Define  the  estimator 


_  1  y.  /(6>(t>) 

m  hi  ’ 


where  9^  ~iid  5(.),  with 

Vrn(Tm-I)  ->d  N[0,  var(//p)], 


and 


Var(/ / g)  =  Eg 


- 12. 


The  latter  may  be  estimated  by 


var  (f/g)  = 


-I2 


Consequently,  the  aim  is  to  find  a  density  that  closely  mimics  /  (up  to  proportion¬ 
ality),  so  that  the  Monte  Carlo  estimator  will  have  low  variance  because  samples 
from  important  regions  of  the  parameter  space  (where  the  function  is  large)  are 
being  drawn,  hence  the  label  importance  sampling  Monte  Carlo.  A  great  strength  of 
importance  sampling  is  that  it  produces  not  only  an  estimate  of  I  but  a  measure  of 
uncertainty  also.  Specifically,  we  may  construct  the  95%  confidence  interval 
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(3.25) 


It  may  seem  strange  to  be  utilizing  an  asymptotic  frequentist  interval  estimate  when 
evaluating  an  integral  for  Bayesian  inference,  but  in  this  context  the  “sample  size”  m 
is  controlled  by  the  user  and  is  large  so  that  an  asymptotic  interval  is  uncontroversial 
(since  a  flat  prior  on  I  would  give  the  same  Bayesian  interval). 

Efficient  use  of  importance  sampling  critically  depends  on  finding  a  suitable  g(  -  ). 
From  the  form  of  var (f/g),  it  is  clear  that  if  the  support  of  9  is  infinite,  </(•)  must 
dominate  in  the  tails;  otherwise,  the  variance  will  be  infinite  and  the  estimate  will 
not  be  useful  in  practice  (even  though  the  estimator  is  unbiased).  It  is  also  desirable 
to  have  a  g(-)  which  is  computationally  inexpensive  to  sample  from.  Student’s  t,  or 
mixtures  of  Student’s  t  distributions  (West  1993),  perhaps  with  iteration  to  tune  the 
proposal,  are  popular. 


3. 7.6  Direct  Sampling  Using  Conjugacy 

The  emergence  of  methods  to  sample  from  the  posterior  distribution  have  revo¬ 
lutionized  the  practical  applicability  of  the  Bayesian  inferential  approach.  Such 
methods  utilize  the  duality  between  samples  and  densities:  Given  a  sample,  we  can 
reconstruct  the  density  and  functions  of  interest,  and  given  an  arbitrary  density,  we 
can  almost  always  generate  a  sample,  given  the  range  of  generic  random  variate 
generators  available.  With  respect  to  the  latter,  the  ability  to  obtain  direct  samples 
from  a  distribution  decreases  as  the  dimensionality  of  the  parameter  space  increases, 
and  MCMC  methods  provide  an  attractive  alternative.  However,  as  discussed  in 
Sect.  3.8,  a  major  practical  disadvantage  to  the  use  of  MCMC  is  that  the  generated 
samples  are  dependent  which  complicates  the  calculation  of  Monte  Carlo  standard 
errors.  Automation  of  MCMC  algorithms  is  also  not  straightforward  since  an 
assessment  of  the  convergence  of  the  Markov  chain  is  required.  Further,  it  is  not 
straightforward  to  calculate  marginal  densities  such  as  (3.5)  with  MCMC.  For 
problems  with  small  numbers  of  parameters,  direct  sampling  methods  provide  a 
strong  competitor  to  MCMC,  primarily  because  independent  samples  from  the 
posterior  are  provided  and  no  assessment  of  convergence  is  required. 

Suppose  we  have  generated  independent  samples  {Q^\t  =  1  from 

p{0  |  y),  with  6^  =  [9^\  . . .  ,9^]  ;  we  describe  how  such  samples  may  be  used 
for  inference.  The  univariate  marginal  posterior  for  p{9j  |  y)  may  be  approximated 
by  the  histogram  constructed  from  the  points  9<Jt\  t  =  1 , ,m.  Posterior  means 
E  [6j  |  y]  may  be  approximated  by 
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with  other  moments  following  in  an  obvious  fashion.  Coverage  probabilities  of  the 
form  Pr(a  <0j<b\  y)  are  estimated  by 


Pr(a  <  Oj  <  b  |  y)  =  —  ^  I  (a  <  6^  <  b^j  , 
m  t= 1 


with  /(•)  representing  the  indicator  function  which  is  1  if  its  argument  is  true  and 
0  otherwise.  The  central  limit  theorem  (Appendix  G)  allows  the  accuracy  of  these 
approximations  to  be  simply  determined  since  the  samples  are  independent. 

We  discuss  how  to  estimate  the  standard  error  associated  with  the  estimate 


E0(t) 

t= l 


(3.26) 


of  y  =  E [9  |  y\.  By  the  strong  law  of  large  numbers,  ym  — »■ Q.s .  y  as  m  — >  oo,  and 
the  central  limit  theorem  (Appendix  G)  gives 


Vm(ym  ~  y)  ~>d  N(0,  a2) 


where  a1  =  var( 0  |  y)  (assuming  this  variance  exists).  The  Monte  Carlo  standard 
error  is  cr/y/rn,  with  consistent  estimate  of  er: 


N 


t=l 


By  Slutsky’s  theorem  (Appendix  G) 

Mm  M 


<Tm  hjm 


~^d  N(0,  1) 


as  m  — >  oo.  An  asymptotic  confidence  interval  for  //  is  therefore 


ym  ±  1-96  x  -p=. 

\/m 


We  may  also  wish  to  obtain  standard  errors  for  functions  that  are  not  simple  expec¬ 
tations.  For  example,  consider  the  posterior  variance  of  a  univariate  parameter  9: 

a2  =  var (9  \  y)  =  E[(0  -  y)2  \  y\. 

where  y  =  E[9  \  y).  An  obvious  estimator  is 
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where  j2m  is  given  by  (3.26).  Now, 


y/m 


~^d  N2 


9*3  ]\ 

>3  ^  -o-4Jy 


where  p*  =  E [{9  —  /i)J  |  y\  is  the  jth  central  moment,  j  =3,4  (where  we  assume 
that  these  quantities  exist).  The  standard  error  of  a2  is  estimated  by 


where  /IJ  =  —  ] um)4  which  can,  unfortunately,  be  highly  unstable. 

Therefore,  accurate  interval  estimates  for  a2  require  larger  sample  sizes  than  are 
needed  for  accurate  estimates  for  /. 1 . 

Once  samples  from  p(0  \  y)  are  obtained,  it  is  straightforward  to  convert  to 
samples  for  a  parameter  of  interest  g(0)  via  g(0  J>).  This  property  is  important 
in  a  conjugate  setting  since  although  we  have  analytical  tractability  for  one  set  of 
parameters,  we  may  be  interested  in  functions  of  interest  that  are  not  so  convenient. 
For  example,  with  likelihood  Y  \  9  ~  Binomial(?z,  9)  and  prior  9  ~  Be(a,  6),  we 
know  that  9  \  y  ~  Be(a  +  y,  b  +  n  —  y).  However,  suppose  we  are  interested  in  the 
odds  g{9)  =  9/(1  —  9).  Given  samples  9 W  from  the  beta  posterior,  we  can  simply 
form  g(9^t'))  =  9^/(1  —  $W),  t  =  1, . . . ,  m.  As  an  aside,  in  this  setting,  for  a 
Bayesian  analysis  with  a  proper  prior,  the  realizations  Y  =  0  or  V  =  n  do  not  cause 
problems,  in  contrast  to  the  frequentist  case  in  which  the  MLE  for  ij(0)  is  undefined. 


3. 7. 7  Direct  Sampling  Using  the  Rejection  Algorithm 

The  rejection  algorithm  is  a  generic  and  widely  applicable  method  for  generating 
samples  from  arbitrary  probability  distributions. 

Theorem  (Rejection  Sampling). 

Suppose  we  wish  to  sample  from  the  distribution 

ftx)  = 

n  >  J  f*(x)  dx’ 

and  vre  have  a  proposal  distribution  gf)  for  which 

f*(x) 

M  =  sup  — -2—  <  00. 

x  g{x) 


Then  the  algorithm: 

1.  Generate  U  ~  U( 0, 1)  and,  independently,  X  ~  g{-). 
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2.  Accept  X  if 


U  < 


Mg(X)  ’ 


otherwise  return  to  1, 

produces  accepted  points  with  distribution  f(x),  and  the  acceptance  probability  is 


Pa  = 


//*( x)  dx 
M 


Proof.  The  following  is  based  on  Ripley  (1987).  We  have 


Pr(X  <  x  (T  acceptance  )  =  Pr(X  <  x)  Pr(  acceptance  |  X  <  x) 

r 

=  /  g(y)  Pr(  acceptance  |  y)  dy 


f  9(y ) 

J  —  oo 


f*(y) 

Mg(y) 


dy  = 


f*(y) 

M 


dy. 


The  probability  of  acceptance  is 

fC 

Pr(acceptance)  =  / 

J  —  C 


f*(y) 

M 


dy  =pa- 


The  number  of  iterations  until  accepting  a  point  is  a  geometric  random  variable  with 
probability  pa.  The  expected  number  of  iterations  until  acceptance  is  pf1.  It  follows 
that 


Pr(X  <  x  |  acceptance) 


OO 

Pr(  acceptance  on  the  zth  trial ) 

2=1 


Ed-^r1 


= _l  r  m 

M  pa  J-  oo  M 


dy 


M 


r  qp*- r  ^ 

J  — OO  -dd  J  — oo 


as  required.  □ 

We  describe  a  rejection  algorithm  that  is  convenient  for  generating  samples  from 
the  posterior  (Smith  and  Gelfand  1992).  Let  Q  denote  the  unknown  parameters,  and 
assume  that  we  can  evaluate  the  maximized  likelihood 


M  =  supp(y  |  9)  =p{y  \  9) 
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where  9  is  the  MLE.  The  algorithm  then  proceeds  as  follows: 

1.  Generate  U  ~  U(0, 1)  and,  independently,  sample  from  the  prior,  9  ~  7T(0). 

2.  Accept  9  if 

M 

otherwise  return  to  1 . 


The  probability  that  a  point  is  accepted  is 


=  Jp(y  I  9)n{9)  do  =  p(y)_ 

Pa  M  M  ' 

This  algorithm  can  be  very  easy  to  implement  since  finding  the  MLE  can  often  be 
carried  out  routinely.  We  need  then  only  generate  points  from  the  prior  and  evaluate 
the  likelihood  at  these  points.  Rejection  sampling  from  the  prior  is  very  intuitive; 
the  prior  supplies  the  points  which  are  then  “filtered  out”  via  the  likelihood. 

The  empirical  rejection  rate  can  be  used  to  derive  the  normalizing  constant  as 

p{y)  =  M  xpa  (3.28) 


which  may  be  useful  for  model  assessment/selection  (Sect.  3.10).  If  we  desire  m 
samples  from  the  posterior,  the  number  of  generations  required  from  the  prior  7 r(-)  is 
to  +  to*  (where  to*  is  the  number  of  rejected  points),  and  to*  is  a  negative  binomial 
random  variable  (Appendix  D).  The  MLE  of  pa  is  m/(m  +  to*). 

An  alternative  importance  sampling  estimator  of  the  normalizing  constant  that  is 
more  efficient  than  (3.28)  is 


p{y)  = 


1 


TO.  +  TO* 


ra+ra 

■  E 

t= l 


p{y  I  0(O), 


(3.29) 


where  9^  ~iid  7r(-),  t  =  +  to*.  Notice  that  there  is  no  rejection 

of  points  associated  with  this  calculation  so  that  all  to  +  m*  prior  points  are 
used.  Although  (3.29)  is  the  more  efficient  estimator,  (3.28)  provides  an  alternative 
estimator  as  a  by-product  that  is  useful  for  code  checking.  The  estimator  (3.28) 
assumes  that  all  normalizing  constants  are  included  in  M.  If  the  maximization  has 
been  carried  out  with  respect  to  M*  =  p*(y  \  6)  where  p*{y  \  0)  =  p(y  \  9)/c, 
then  we  must  instead  use  the  estimate 


p(y)  =  c  X  M*  X  pa.  (3.30) 

Posterior  moments  can  be  estimated  directly  as  averages  of  the  accepted  points, 
or  we  may  implement  importance  sampling  estimators  that  use  all  points  generated 
from  the  prior.  For  example,  the  posterior  mean 

f  6p(y  \  OMe)  dQ  _  E[6p{y\d)} 

Jp(y  |  O)n(0)  dO  E[p(y\6)\ 


E  [6  |  y) 
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may  be  estimated  by 


E[0  |  y\ 


^Er=r*p(yi^(t)) 


where  9^  ~iid  n(-),  t  =  1, . . . ,  to  +  to*. 

Clearly  we  need  a  proper  prior  distribution  to  implement  the  above  algorithm. 
The  efficiency  of  the  algorithm  will  depend  on  the  correspondence  between  the 
likelihood  and  the  prior,  as  measured  through  p(y).  For  large  n,  the  algorithm  will 
become  less  efficient  since  the  likelihood  becomes  increasingly  concentrated,  and 
so  prior  points  are  less  likely  to  be  accepted  (which  is  another  manifestation  of  the 
prior  becoming  less  important  with  increasing  sample  size,  Sect.  3.3). 

The  rejection  algorithm  that  samples  from  the  prior  does  not  need  the  functional 
form  of  the  prior  to  be  available.  As  an  example,  Wakefield  (1996)  used  a  predictive 
distribution  from  a  Bayesian  analysis  as  the  prior  for  the  analysis  of  a  separate 
dataset;  samples  from  the  predictive  distribution  could  be  simply  generated,  even 
though  no  closed  form  was  available  for  this  distribution. 


Example:  Poisson  Likelihood,  Lognormal  Prior 

We  illustrate  some  of  the  technique  described  in  the  previous  sections  using  a 
Poisson  likelihood  with  data  from  a  geographical  cluster  investigation  carried  out 
in  the  United  Kingdom  (Black  1984).  The  Sellafield  nuclear  site  is  located  in  the 
northwest  of  England  on  the  coast  of  West  Cumbria.  Initially,  the  site  produced 
plutonium  for  defense  purposes  and  subsequently  carried  out  the  reprocessing 
of  spent  fuel  from  nuclear  power  stations  in  Britain  and  abroad  and  stored  and 
discharged  to  sea  low-level  radioactive  waste.  Seascale  is  a  village  3  km  to  the  south 
of  Sellafield  and  had  y  =  4  cases  of  lymphoid  malignancy  among  0-14  year  olds 
during  1968-1982,  compared  with  E  =  0.25  expected  cases  (based  on  the  number 
of  children  in  the  region  and  registration  rates  for  the  overall  northern  region  of 
England).  A  question  here  is  whether  such  a  large  number  of  cases  could  have 
reasonably  occurred  by  chance.  There  is  substantial  information  available  on  the 
incidence  of  childhood  leukemia  across  the  United  Kingdom  as  a  whole. 

We  assume  the  model  Y  |  9  ~  Poisson[£'exp(0)],  where  9  is  the  log  relative 
risk  (the  ratio  of  the  risk  in  the  study  region,  to  that  in  the  northern  region),  the 
MLE  of  which  is  9  =  log(16)  =  2.77  with  asymptotic  standard  error  0.25. 
We  assume  an  N(/z,<r2)  normal  prior  for  9,  which  is  equivalent  to  a  lognormal 
prior  LogNorm(/i,  a2)  for  exp(d).  To  choose  the  prior  parameters,  we  assume,  for 
illustration,  that  the  median  relative  risk  is  1  and  the  90%  point  of  the  prior  is  10, 
which  leads,  from  (3.15)  and  (3.16),  to  p  =  0  and  cr2  =  1.382. 
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We  will  estimate 


OO 


dr  Pr (y  |  0)7 r(0)  dd 


—  OO 


Ev{2nb2)-1/2 


Ey(2nb2)~1/2  f 

- - JeMh 


for  r  =  0, 1, 2,  to  give  the  normalizing  constant  and  posterior  mean  and  variance  as 


p(y)  =  h 


We  choose  to  calculate  the  posterior  variance  not  because  it  is  a  quantity  of  particular 
interest  but  because  it  provides  a  summary  that  is  not  particularly  easy  to  estimate 
and  so  reveals  some  of  the  complications  of  the  various  methods. 


To  apply  the  Laplace  method,  we  first  give  the  first  and  second  derivatives  of 

hr{0): 


=  V-~  Eex p(0)  +  y  - 

=  —jj2~  ^ex p(0)  - 


for  r  =  0,1,  2.  The  estimates  based  on  the  Laplace  approximation  are  shown  in 
Table  3.2.  The  mean  and  variance  are  accurately  estimated,  but  the  variance  is 
underestimated  for  these  data.  We  implemented  Gauss-Hermite  rules  using  m  = 
5, 10, 15,  20  points,  with  the  grid  centered  and  scaled  by  the  Laplace  approximations 
of  the  mean  and  variance  of  the  posterior.  Table  3.2  shows  that  Pr (y)  and  E[0  |  y\ 
are  well  estimated  across  all  grid  sizes,  while  there  is  more  variability  in  the 
estimate  of  var(0  |  y),  though  it  is  more  accurately  estimated  then  with  the  Laplace 
approximation. 

We  now  turn  to  importance  sampling.  We  have 


with  fr(0)  =  9rPv {y  |  0)7r(0). 


We  take  as  proposal,  g{-),  a  normal  distribution  scaled  via  the  Laplace  estimates 
of  location  and  scale.  Table  3.2  shows  estimates  resulting  from  the  use  of  to  =  5,000 
points  and  the  estimator 
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Table  3.2  Laplace,  Gauss-Hermite,  and  Monte  Carlo  approximations  for  Poisson  lognormal 
model  with  an  observed  count  of  y  =  and  an  expected  count  of  E  =  0.25 


Pr(y)  (xlO2) 

E[6  |  y] 

var(6>  |  y) 

Truth 

1.37 

2.27 

0.329 

Laplace 

1.35 

2.29 

0.304 

Gauss-Hermite  m  =  5 

1.36 

2.27 

0.328 

Gauss-Hermite  m  =  10 

1.37 

2.27 

0.331 

Gauss-Hermite  m  =  15 

1.37 

2.27 

0.331 

Gauss-Hermite  m  =  20 

1.37 

2.27 

0.331 

Importance  sampling 

1.37  [1.35,1.381 

2.27  [2.24,2.291 

0.336  [0.310,0.362] 

Rejection  algorithm 

1.37 

2.27  [2.25,2.28] 

0.332  [0.319,0.346] 

Metropolis-Hastings 

- 

2.27  [2.22,2.32] 

0.328  [0.294,0.361] 

The  importance  sampling  and  rejection  algorithms  are  based  on  samples  of  size  m  =  5,000. 
The  Metropolis-Hastings  algorithm  was  run  for  51,000  iterations,  with  the  first  1,000 
discarded  as  burn-in.  95%  confidence  intervals  for  the  relevant  estimates  are  displayed  (where 
available)  in  square  brackets  in  the  last  three  lines  of  the  table 

1  ™  /r(*M) 

r 

where  9^  are  independent  samples  from  the  normal  proposal.  The  variance  of  the 
estimator  is 

var  (l)  =  ™ <LM 

The  delta  method  can  be  used  to  produce  measures  of  accuracy  for  the  posterior 
mean  and  variance,  though  these  measures  are  a  little  cumbersome.  The  variance  of 
the  normalizing  constant  is 

var  Pr(y)  =  var(/0). 

To  evaluate  the  variances  of  the  posterior  mean  and  posterior  variance  estimates  we 
need  the  multivariate  delta  method.  We  must  also  include  covariance  terms  if  the 
same  samples  are  used  to  evaluate  all  three  integrals.  The  formulas  are: 
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Fig.  3.3  Histogram  g 
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Using  these  forms  we  obtain  the  interval  estimates  displayed  in  Table  3.2.  The 
estimates  of  each  of  the  three  summaries  are  accurate  though  the  interval  estimate 
for  the  posterior  variance  is  quite  wide,  because  of  the  inherent  instability  associated 
with  estimating  the  standard  error. 

Finally  we  implement  a  rejection  algorithm,  sampling  from  the  prior  distribution 
and  estimating  Pr(y)  using  the  importance  sampling  estimator,  (3.29).  The  mean 
and  variance  of  the  samples  was  used  to  evaluate  E \6  \  y]  and  var(6*  |  y),  with 
the  standard  error  of  the  latter  based  on  (3.27).  The  acceptance  probability  was 
0.07,  the  small  value  being  explained  by  the  discrepancy  between  the  prior  and  the 
likelihood,  which  is  illustrated  in  Fig.  3.3(a)  which  gives  a  histogram  representation, 
based  on  5000  points,  of  p(9  \  y ),  along  with  the  prior  drawn  as  a  solid  curve. 
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Panel  (b)  displays  the  marginal  posterior  distribution  of  the  relative  risk,  p(ee  \  y ), 
which  is  of  more  substantive  interest,  and  is  simply  produced  via  exponentiation  of 
the  0  samples.  The  rejection  estimates  in  Table  3.2  have  relatively  narrow  interval 
estimates. 


3.8  Markov  Chain  Monte  Carlo 

3.8.1  Markov  Chains  for  Exploring  Posterior  Distributions 

The  fundamental  idea  behind  MCMC  is  to  construct  a  Markov  chain  over  the 
parameter  space,  with  invariant  distribution  the  posterior  distribution  of  interest. 
Specifically,  consider  a  random  variable  x  with  support  W  and  density  7r(-).  We 
give  a  short  summary  of  the  essence  of  discrete  time  Markov  chain  theory. 

A  sequence  of  random  variables  X(°\  X(  1  - , ...  is  called  a  Markov  chain  on  a 
state  space  ]RP  if  for  all  t  and  for  all  measurable  sets  A: 

Pr  (x(t+P  G  A  |  ■  • ,  ^(0))  =  Pr  (XP+P  G  A  \  X(t^ 

so  that  the  probability  of  moving  to  any  set  A  at  time  t  +  1  only  depends  on  where 
we  are  at  time  t.  Furthermore,  for  a  homogeneous  Markov  chain, 

Pr  (x<*+1>  G  A  |  XW)  =  Pr  (X«  G  A  |  X<0>)  . 

If  there  exists  p( x,  y)  such  that 

Pr(Xi  G  A  |  x)  =  /  p(x,  y)  dy , 

J  A 

then  p(x,  y)  is  called  the  transition  kernel  density.  A  probability  distribution  7r(-) 
on  9J’  is  called  an  invariant  distribution  of  a  Markov  chain  with  transition  kernel 
density  p{x ,  y)  if  so-called  global  balance  holds: 

7 r(y)  =  /  7T (x)p{x,y)  dx. 

JRp 

A  Markov  chain  is  called  reversible  if 

n(x)p(x,y)  =n{y)p{y,x)  (3.31) 

for  x,  y  G  Kp,  x  f  y.  It  can  shown  (Exercise  3.5)  that  if  (3.31)  holds,  then  7 r(-)  is 
the  invariant  distribution  which  is  useful  since  (3.31)  can  be  easy  to  check. 

A  key  idea  is  that  if  we  have  an  invariant  distribution,  then  we  can  evaluate 
long-term,  or  ergodic,  averages  from  realizations  of  the  chain.  This  is  crucial  for 
making  inference  in  a  Bayesian  setting  since  it  means  we  can  estimate  quantities  of 
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interest  such  as  posterior  means  and  medians.  In  Markov  chain  theory,  conditions  on 
the  transition  kernel  under  which  invariant  distributions  exist  is  an  important  topic. 
Within  an  MCMC  context,  this  is  not  important  since  the  posterior  distribution  is 
the  invariant  distribution  and  we  are  concerned  with  constructing  Markov  chains 
(transition  kernels)  with  7r(-)  as  invariant  distribution.  Only  very  mild  conditions 
are  typically  required  to  ensure  that  w(-)  is  the  invariant  distribution,  typically  ape- 
no  do  city  and  irreducibility .  A  chain  is  periodic  if  there  are  places  in  the  parameter 
space  that  can  only  be  reached  at  certain  regularly  spaced  times;  otherwise,  it  is 
aperiodic.  A  Markov  chain  with  invariant  distribution  7 r(-)  is  irreducible  if  for  any 
starting  point,  there  is  positive  probability  of  entering  any  set  to  which  7r(-)  assigns 
positive  probability. 

Suppose  that  x' 1  * , . . . ,  xlr":>  represents  the  sample  path  of  the  Markov  chain. 
Then  expectations  with  respect  to  the  invariant  distribution 

p  =  E[ff(a;)]  =  J  g{x) n(x)  dx 

may  be  approximated  by  pm  =  ^  Yl'tLi  d{x^)-  Monte  Carlo  standard  errors  are 
more  difficult  to  obtain  than  in  the  independent  sampling  case.  The  Markov  chain 
law  of  large  numbers  (the  ergodic  theorem)  tells  us  that 

dm  ^ a.s .  d 

as  m  — >  00,  and  the  Markov  chain  central  limit  theorem  states  that 

y/rn{dm  -  d)  ~>d  N(0,  r2) 


where 


r  =  var 


g(x(t))  +2^cov  g(x{-t)),g{x{:t+k)) 


fc= 1 


(3.32) 


and  the  summation  term  accounts  for  the  dependence  in  the  chain.  Chan  and  Geyer 
(1994)  provide  assumptions  for  validity  of  this  form.  Section  3.8.6  describes  how 
r 2  may  be  estimated  in  practice.  We  now  describe  algorithms  that  define  Markov 
chains  that  are  well  suited  to  Bayesian  computation. 


3.8.2  The  Metropolis-Hastings  Algorithm 

The  Metropolis-Hastings  algorithm  (Metropolis  et  al.  1953;  Hastings  1970)  pro¬ 
vides  a  very  flexible  method  for  defining  a  Markov  chain.  At  iteration  t  of  the 
Markov  chain’s  evolution,  suppose  the  current  point  is  x^K  The  following  steps 
provide  the  new  point  x^t+1'1: 


3.8  Markov  Chain  Monte  Carlo 


123 


1.  Sample  a  point  y  from  a  proposal  distribution  q(  ■  \  x,f> ). 

2.  Calculate  the  acceptance  probability: 


a(x<'t\y)  =  min 


Ay)  I  V )  1 

n(x  W)  q(y  |  ccW)  ’ 


3.  Set 


CE('t+1)  = 


y  with  probability  a(x W ,  y) 

x W  otherwise. 


(3.33) 


In  a  Bayesian  context,  the  term  7r(y) jrdxA'l )  in  (3.33)  is  the  ratio  of  the  posterior 
densities  at  the  proposed  to  the  current  point;  since  we  are  taking  the  ratio,  the 
normalizing  constant  in  the  posterior  cancels,  which  is  crucial  since  this  is  typically 
unavailable.  The  second  term  in  (3.33)  is  the  ratio  of  the  density  of  moving  from 
y  — >  x't!  to  the  density  of  moving  from  x't!  — »  y,  and  it  is  this  term  that 
guarantees  global  balance  and  hence  that  the  Markov  chain  has  the  correct  invariant 
distribution;  see  Exercise  3.6.  In  an  independence  chain,  the  proposal  distribution 
does  not  depend  on  the  current  point,  that  is,  q(y  \  xtv> )  is  independent  of  x(t> .  We 
now  consider  a  special  case  of  the  algorithm  that  is  particularly  easy  to  implement 
and  widely  used. 


3.8.3  The  Metropolis  Algorithm 

Suppose  the  proposal  distribution  is  symmetric  in  the  sense  that 

g(y  |  x(t))  =  g(x(t)  |  y). 

In  this  case  the  product  of  ratios  in  (3.33)  simplifies  to 

Av)  1  ’ 

Tt(x(T>)  ’ 

so  that  only  the  ratio  of  target  posterior  densities  is  required.  In  the  random  walk , 
Metropolis  algorithm  q(y  \  x W)  =  q{  \y  —  x W|  ),  with  common  choices  for  q(-) 
being  normal  or  uniform  distributions.  In  a  range  of  circumstances,  an  acceptance 
probability  of  around  30%  is  optimal  (Roberts  et  al.  1997),  which  may  be  obtained 
by  tuning  the  proposal  density,  the  variance  in  a  normal  proposal,  for  example.  The 
balancing  act  is  between  having  high  acceptance  rates  with  small  movement  and 
having  low  acceptance  rates  with  large  movement. 


a(x^\  y)  =  min 


3.8.4  The  Gibbs  Sampler 

We  describe  a  particularly  popular  algorithm  for  simulating  from  a  Markov  chain, 
the  Gibbs  sampler.  We  describe  two  flavors:  the  sequential  Gibbs  sampler  and  the 
random  scan  Gibbs  sampler.  In  the  following,  let  .x_,  represent  the  vector  x  with 
the  ith  variable  removed,  that  is,  =  [x\, . . . ,  Xi-i,  Xi+i, . . . ,  xp\. 
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The  sequential  scan  Gibbs  sampling  algorithm  starts  with  some  initial  value  x1'0'1 
and  then,  with  current  point  =  [x^ , . . . ,  Xp  ■*],  undertakes  the  following  p  steps 
to  produce  a  new  point  x (t+1)  =  . . . ,  ^  ] : 


Sample 

T(t  +  1)  r 
X1 

^  7T! 

( Xi 

*-l) 

Sample 

r{t+1)  t 

a- 2 

^  7T2 

[x2 

Sample 

T(t  +  1)  r 

^  7Tp 

iXP 

i  ^T] 

The  beauty  of  the  Gibbs  sampler  is  that  the  often  hard  problem  of  sampling  for 
the  full  /^-dimensional  variable  x  has  been  broken  into  sampling  for  each  of  the  p 
variables  in  turn  via  the  conditional  distributions. 

We  now  illustrate  that  the  Gibbs  sampling  algorithm  produces  a  transition  kernel 
density  that  gives  the  required  stationary  distribution.  We  do  this  by  showing  that 
each  component  is  a  Metropolis-Hastings  step.  Consider  a  single  component  move 
in  the  Gibbs  sampler  from  the  current  point  x^  to  the  new  point  xl  f '  1  ,  with  1 1 
obtained  by  replacing  the  ith  component  in  x/t]  with  a  draw  from  the  full  conditional 
7 r  (xi  |  x^l  'j .  We  view  this  move  in  light  of  the  Metropolis-Hastings  algorithm 
in  which  the  proposal  density  is  the  full  conditional  itself.  Then  the  Metropolis- 
Hastings  acceptance  ratio  becomes 


Consequently,  when  we  use  full  conditionals  as  our  proposals  in  the  Metropolis- 
Hastings  step,  we  always  accept.  This  means  that  drawing  from  a  full  conditional 
distribution  produces  a  Markov  chain  with  stationary  distribution  tt(x).  Clearly,  we 
cannot  keep  updating  only  the  /th  component,  because  we  will  not  be  able  to  explore 
the  whole  state  space  this  way,  that  is,  we  do  not  have  an  irreducible  Markov  chain. 
Therefore,  we  can  update  each  component  in  turn,  though  this  is  not  the  only  way  to 
execute  Gibbs  sampling  (though  it  is  the  easiest  to  implement  and  the  most  common 
approach).  We  can  also  randomly  select  an  component  to  update.  This  is  called 
random  scan  Gibbs  sampling: 


Sample  a  component  i  by  drawing  a  random  variable  with  probability  mass 
function  [ai, . . . ,  otv\  where  a,  >  0  and  ai  =  1- 

Sample  xf +1)  ~  7 r*  (xt  \  x^  . 
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Roberts  and  Sahu  (1997)  examine  the  convergence  rate  of  the  sequential  and  random 
scan  Gibbs  sampling  schemes  and  show  that  the  sequential  scan  version  has  a  better 
rate  of  convergence  in  the  Gaussian  models  they  examine. 

In  many  cases,  conjugacy  (Sect.  3.7.1)  can  be  exploited  to  derive  the  conditional 
distributions.  Many  examples  of  this  are  given  in  Chaps.  5  and  8.  It  is  also  common 
for  sampling  from  a  full  conditional  distribution  to  not  require  knowledge  of  the 
normalizing  constant  of  the  target  distribution.  For  example,  we  saw  in  Sect.  3.7.7 
that  rejection  sampling  does  not  require  the  normalizing  constant. 


3.8.5  Combining  Markov  Kernels:  Hybrid  Schemes 

Suppose  we  can  construct  m  transition  kernels,  each  with  invariant  distribution 
7r ( * ) .  There  are  two  simple  ways  to  combine  these  transition  kernels.  First,  we  can 
construct  a  Markov  chain,  where  at  each  step  we  sequentially  generate  new  states 
from  all  kernels  in  a  predetermined  order.  As  long  as  the  new  Markov  chain  is 
irreducible,  then  it  will  have  the  required  invariant  distribution,  and  we  can,  for 
example,  use  the  ergodic  theorem  on  the  samples  from  the  new  Markov  chain. 
Hence,  we  can  combine  Gibbs  and  Metropolis-Hastings  steps.  One  popular  form 
is  Metropolis  within  Gibbs  in  which  all  components  with  recognizable  conditionals 
are  sampled  with  Gibbs  steps  with  Metropolis-Hastings  for  the  remainder.  In  the 
second  method  of  combining  Markov  kernels,  we  first  create  a  probability  vector 
am],  then  randomly  select  kernel  i  with  probability  a, ,  and  then  use  this 
kernel  to  move  the  Markov  chain. 

In  general,  one  can  be  creative  in  the  construction  of  a  Markov  chain,  but  care 
must  be  taken  to  ensure  the  proposed  chain  is  “legal,”  in  the  sense  of  having  the 
required  stationary  distribution.  As  an  example,  a  chain  with  a  Metropolis  step  that 
keeps  proposing  points  until  the  fcth  point,  with  k  >  1,  is  accepted  does  not  have 
the  correct  invariant  distribution. 

A  final  warning  is  that  care  is  required  to  ensure  that  the  posterior  of  interest  is 
proper  since  there  is  no  built  in  check  when  an  MCMC  scheme  is  implemented. 
For  example,  one  may  be  able  to  construct  a  set  of  proper  conditional  distributions 
for  Gibbs  sampling,  even  when  the  joint  posterior  distribution  is  not  proper;  see,  for 
example,  Hobert  and  Casella  (1996). 


3.8.6  Implementation  Details 

Although  theoretically  not  required,  many  users  remove  an  initial  number  of 
iterations,  the  rationale  being  that  inferential  summaries  should  not  be  influenced 
by  initial  points  that  might  be  far  from  the  main  mass  of  the  posterior  distribution. 
Inference  is  then  based  on  samples  collected  subsequent  to  this  “burn-in”  period. 
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In  order  to  obtain  valid  Monte  Carlo  standard  errors  for  empirical  averages, 
some  estimate  for  r2  in  (3.32)  is  required.  Time  series  methods  exist  to  estimate 
t2,  but  we  describe  a  simple  approach  based  on  batch  means  (Glynn  and  Iglehart 
1990).  The  basic  idea  is  to  split  the  output  of  length  m  into  K  batches  each  of 
length  B ,  with  B  chosen  to  be  large  enough  so  that  the  batch  means  have  low  serial 
correlation;  B  should  not  be  too  large,  however,  because  we  want  K  to  be  large 
enough  to  provide  a  reliable  estimate  of  r2.  The  mean  of  the  function  of  interest  is 
then  estimated  within  each  of  the  batches: 


t=(fc-l)B+l 


for  k  =  1 .....  A' .  The  combined  estimate  of  the  mean  is  the  average  of  the  batch 
means 


Then  \f~B(jlk  —  /.(),  k  =  i .....  K  are  approximately  independently  distributed  as 
N(0,  t2),  and  so  r2  can  be  estimated  by 


and 


Normal  or  Students  t  confidence  intervals  can  be  calculated  based  on  the  square 
root  of  this  quantity.  The  construction  of  these  intervals  has  the  advantage  of 
being  simple,  but  the  output  should  be  viewed  with  caution  as  the  above  derivation 
contains  a  number  of  approximations. 

MCMC  approaches  provide  no  obvious  estimator  of  the  normalizing  constant 
p{y),  but  a  number  of  indirect  methods  have  been  proposed  (Meng  and  Wong  1996; 
DiCiccio  et  al.  1997) 

Aside  from  directly  calculating  integrals,  we  may  also  form  graphical  summaries 
of  parameters  of  interest,  essentially  using  the  dependent  samples  in  the  same  way 
that  we  would  independent  samples.  For  example,  a  histogram  of  xf!  provides  an 
estimate  of  the  posterior  marginal  distribution,  iTi(xi),  i  =  1 ,p. 

In  practice,  there  are  a  number  of  important  issues  that  require  thought  when 
implementing  MCMC.  A  crucial  question  is  how  large  m  should  be  in  order  to 
obtain  a  reliable  Monte  Carlo  estimate.  The  Markov  chain  will  display  better  mixing 
properties  if  the  parameters  are  approximately  independent  in  the  posterior.  In  an 
extreme  case,  if  we  have  independence,  then 
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p 

n(x1,  ...,xp)  =  JjTr(Xi) 

i= 1 

and  Gibbs  sampling  via  the  conditional  distributions  7r(xj),  i  =  1, . . .  ,p,  equates  to 
direct  sampling  from  the  posterior. 

Dependence  in  the  Markov  chain  may  be  greatly  reduced  by  sampling  simul¬ 
taneously  for  variables  that  are  highly  depend,  a  strategy  known  as  blocking. 
Reparameterization  may  also  be  helpful  in  this  regard.  As  the  blocks  become  larger, 
the  acceptance  rate  (if  a  Metropolis-Hastings  algorithm  is  used)  may  be  reduced 
to  an  unacceptably  low  level  in  which  case  there  is  a  trade-off  with  respect  to  the 
size  of  blocks  to  use.  Some  chains  may  be  very  slow  mixing,  and  an  examination 
of  autocorrelation  aids  in  deciding  on  the  number  of  iterations  required.  If  storage 
of  samples  is  an  issue,  then  one  may  decide  to  “thin”  the  chain  by  only  collecting 
samples  at  equally  spaced  intervals. 

A  number  of  methods  have  been  proposed  for  “diagnosing  convergence.”  Trace 
plots  provide  a  useful  method  for  detecting  problems  with  MCMC  convergence  and 
mixing.  Ideally,  trace  plots  of  unnormalized  log  posterior  and  model  parameters 
should  look  like  stationary  time  series.  Slowly  mixing  Markov  chains  produce  trace 
plots  with  high  autocorrelation,  which  can  be  further  visualized  by  plotting  the 
autocorrelation  at  different  lags.  Slow  mixing  does  not  imply  lack  of  convergence, 
however,  but  that  more  samples  will  be  required  for  accurate  inference  (as  can  be 
seen  from  (3.32)).  When  examining  trace  plots  and  autocorrelations,  it  is  clearer  to 
work  with  parameters  transformed  to  R.  Running  multiple  chains  from  different 
starting  points  is  also  very  useful  since  one  may  compare  inference  between 
the  different  chains.  Gelman  and  Rubin  (1992)  provide  one  popular  convergence 
diagnostic  based  on  multiple  chains.  As  with  the  use  of  diagnostics  in  regression 
modeling,  convergence  diagnostics  may  detect  evidence  of  poor  behavior,  but  there 
is  no  guarantee  of  good  behavior  of  the  chain,  even  if  all  convergence  diagnostics 
appear  reasonable. 


Example:  Poisson  Likelihood,  Lognormal  Prior 

Recall  the  Poisson  lognormal  example  in  which  y  =  4  and  E  =  0.25  with  a  single 
parameter,  the  log  relative  risk  9.  Gibbs  sampling  corresponds  to  direct  sampling 
from  the  univariate  posterior  for  9 ,  which  we  have  already  illustrated  using  the 
rejection  algorithm. 

We  implement  a  random  walk  Metropolis  algorithm  using  a  normal  kernel  and 
the  asymptotic  variance  of  the  MLE  for  9  multiplied  by  3  as  the  variance  of  the 
proposal,  to  achieve  a  reasonable  acceptance  probability  of  0.32.  This  multiplier 
was  found  by  trial  and  error,  based  on  preliminary  runs  of  the  Markov  chain.  It  is 
important  to  restart  the  chain  when  the  proposal  is  changed  based  on  past  real¬ 
izations  to  ensure  the  chain  is  still  Markovian.  Table  3.2  gives  estimates  of  the 
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posterior  mean  and  variance  based  on  a  run  length  of  51,000,  with  the  first  1,000 
discarded  as  a  burn-in.  The  confidence  interval  for  the  estimates  of  the  posterior 
mean  and  posterior  variance  is  based  on  the  batch  means  method,  with  K—  50 
batches  of  size  B  =  1,000. 


Example:  Lung  Cancer  and  Radon 

We  return  to  the  lung  cancer  and  radon  example,  first  introduced  in  Sect.  1.3.3,  to 
demonstrate  the  use  of  the  Metropolis  random  walk  algorithm  in  a  situation  with 
more  than  one  parameter.  For  direct  comparison  with  methods  applied  in  Chap.  2, 
we  assume  an  improper  flat  prior  on  (3  =  \Po,  Pi]  so  that  the  posterior  p(f3  \  y)  is 
proportional  to  the  likelihood. 

We  begin  by  implementing  a  Metropolis  random  walk  algorithm  based  on  a 
pair  of  univariate  normal  distributions.  In  this  example,  the  Gibbs  sampler  is  less 
appealing  since  the  required  conditional  distributions  do  not  assume  known  forms. 
The  first  step  is  to  initialize  P^  =  f3j ,  where  Pj ,  j  =  0, 1,  are  the  MLEs.  We  then 
iterate,  at  iteration  t,  between: 

1.  Generate  /3q  ~  N(/3q\  CoVo),  where  Vo  is  the  asymptotic  variance  of  /?o- 
Calculate  the  acceptance  probability: 

ao{Po,  Po^)  =  min 

and  set 

g(t+ 1)  =  j  Po  with  probability  a0{P  ft,  Po), 

0  Pq  '1  otherwise. 

2.  Generate  p*  ~  N(/3]*'1 ,  C\  V\ ),  where  Vi  is  the  asymptotic  variance  of  P\. 
Calculate  the  acceptance  probability: 


pjPoiPi'1 1  y)  x 

p(P I  v ) 


and  set 


=  min 


'  p(P{0t+1),P  1 1  y) 

_P($+I\p[t)  \y) 


1 


/3{t+1) 


Pi  with  probability  Pi  '1)-, 

Pi'1  otherwise. 


The  constants  Cq  and  ci  are  chosen  to  provide  a  trade-off  between  gaining  a 
high  proportion  of  acceptances  and  moving  around  the  support  of  the  parameter 
space;  this  is  illustrated  in  Fig.  3.4  where  the  realized  parameters  from  the  first 
1 ,000  iterations  of  two  Markov  chains  are  plotted.  In  panels  (a)  and  (d),  we  chose 


3.8  Markov  Chain  Monte  Carlo 


129 


Iteration 


200  400  600  800  1000 
Iteration 


Iteration 


Iteration 


Iteration 


i  i  i  i  i  r 

0  200  400  600  8001000 
Iteration 


Fig.  3.4  Sample  paths  from  Metropolis-Hastings  algorithms  for  (3q  ( top  row )  and  /3i  (, bottom  row) 
for  the  lung  cancer  and  radon  data.  In  the  left  column  the  proposal  random  walk  has  small  variance; 
in  the  center  column  large  variance  and  in  the  right  column,  we  use  a  bivariate  proposal 


Co  =  ci  =  c  =  0.1  and  in  panels  (b)  and  (e)  Co  =  Ci  =  c  =  2.  For  c  =  0.1 
the  acceptance  rate  is  0.90,  but  movement  around  the  space  is  slow,  as  indicated  by 
the  meandering  nature  of  the  chain,  while  for  c  =  2  the  moves  tend  to  be  larger, 
but  the  chain  sticks  at  certain  values,  as  seen  by  the  horizontal  runs  of  points  (the 
acceptance  rate  is  0.14). 

Figure  3.6a  shows  a  scatterplot  representation  of  the  joint  distribution  p(Po, 
Pi  |  y)  and  clearly  shows  the  strong  negative  dependence;  the  asymptotic  correlation 
between  the  MLEs  Pq  and  B\  is  —0.90,  and  the  posterior  correlation  between 
Po  and  Pi  is  —0.90  also  (the  correspondence  between  these  correlations  is  not 
surprising  since  the  sample  size  is  large  and  the  prior  is  flat).  The  strong  negative 
dependence  is  evident  in  each  of  the  first  two  columns  of  Fig.  3.4.  Figure  3.5  shows 
the  autocorrelations  between  sampled  parameters  at  lags  of  between  1  and  40.  The 
top  row  is  for  Pq,  and  the  bottom  is  for  Pi.  In  panels  (a)  and  (d),  the  autocorrelations 
are  high  because  of  the  small  movements  of  the  chain. 

The  dependence  in  the  chain  may  be  reduced  via  reparameterization  or  by 
generation  from  a  bivariate  proposal.  We  implement  the  latter  with  variance- 
covariance  matrix  equal  to  c  x  var(/3).  The  acceptance  rate  for  the  bivariate  proposal 
with  c  =  2  is  0.29,  which  is  reasonable.  We  then  iterate  the  following: 
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Fig.  3.5  Autocorrelation  functions  for  3o  ( top  row)  and  /3i  ( bottom  row)  for  the  lung  cancer  and 
radon  data.  First  column',  univariate  random  walk,  c  =  0.1,  second  column :  univariate  random 
walk,  c  =  2,  third  column',  bivariate  random  walk,  c  =  2 


1.  Generate  (3*  ~  N2(/3  ,  cV),  where  V  is  the  asymptotic  variance  of  the 

MLE^- 

2.  Calculate  the  acceptance  probability 


=  min 


p(P*  1  v)  1 

p{l 3(t)  I  y)  ’ 


and  set 


pit+D  = 


(3*  with  probability  ct(f3* ,  /3^), 
otherwise. 


Note  that  the  choice  of  c  and  the  dependence  in  the  chain  do  not  jeopardize  the 
invariant  distribution,  but  rather  the  length  of  chain  until  practical  convergence  is 
reached  and  the  number  of  points  required  for  summarization.  More  points  are 
required  when  there  is  high  positive  dependence  in  successive  iterates,  which  is  clear 
from  (3.32).  The  final  column  of  Fig.  3.4  shows  the  sample  path  from  the  bivariate 
proposal,  with  good  movement  and  little  dependence  between  the  parameters. 
Panels  (c)  and  (f)  show  that  the  autocorrelation  is  also  greatly  reduced. 
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Fig.  3.6  Posterior  summaries  for  the  lung  cancer  and  radon  data:  (a)  p(/ 3oi/3i  |  y), 
(b)p(log  00, log  01  I  y),(c)p(Po  I  y),(d)p(Bo  \  y),(e)p(/3i  \  y),(I)p(0i  I  y) 


Figure  3.6  shows  inference  for  the  reparameterized  model 

Yi  |  Q  ~ind  Poisson(£'i0o^ii_a:) 

where  0o  =  exp(/?o  +  Pix)  >  0  and  9 1  =  exp(/3i)  >  0  along  with  summaries 
for  the  PoiPi  parameterization.  Figure  3.6(b)  shows  the  bivariate  posterior  for 
log  0o,  log  6*i  and  demonstrates  that  the  parameters  are  virtually  independent  (the 
correlation  is  —0.03).  By  comparison  there  is  strong  negative  dependence  between 
Po  and  Pi  (panel  (a)).  Panels  (d)  and  (f)  show  histogram  representations  of  the 

posteriors  of  interest  p (0o  |  y)  andp(0i  |  y). 

The  posterior  median  (95%  credible  interval)  for  exp(/?i)  is  0.965  [0.954, 0.975] 
which  is  almost  identical  to  the  asymptotic  inference  under  a  Poisson  model 
(Table  2.4),  which  is  again  not  surprising  given  the  large  sample  size. 
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Fig.  3.7  (a)  Mean-variance  relationships,  in  the  negative  binomial  model,  for  values  of  b  between 
50  and  200,  in  increments  of  10  units.  The  dashed  line  is  the  line  of  equality  corresponding  to  the 
Poisson  model,  which  is  recovered  as  b  — *■  oo.  (b)  Lognormal  prior  for  b 


The  Poisson  model  should  be  used  with  caution  since  the  variance  is  determined 
by  the  mean,  with  no  additional  parameter  to  soak  up  excess-Poisson  variability, 
which  is  often  present  in  practice.  To  overcome  this  shortcoming  we  provide  a 
Bayesian  analysis  with  a  negative  binomial  likelihood,  parameterized  so  that 

E [Yi  |  /3,b]=  m(p),  var(Fi  |  (3,b)  =  m(/3)[l  +  n(@)/b\.  (3.34) 

We  will  continue  with  an  improper  flat  prior  for  f3,  but  a  prior  for  b  requires  more 
thought.  To  determine  a  prior,  we  plot  the  mean-variance  relationship  in  Fig.  3.7a, 
for  different  values  of  b.  In  this  example  the  regression  model  does  not  include 
information  on  confounders  such  as  smoking.  The  absence  of  these  variables  will 
certainly  lead  to  bias  in  the  estimate  of  exp(/?i)  due  to  confounding,  but  with 
respect  to  b,  we  might  expect  considerable  excess-Poisson  variability  due  to  missing 
variables.  The  sample  average  of  the  observed  counts  is  158,  and  we  specify  a 
lognormal  prior  for  b  by  giving  two  quantiles  of  the  overdispersion,  /z(l  +  /i/6),  at 
/i  =  158,  and  then  solve  for  b.  Specifically,  we  suppose  that  there  is  a  50%  chance 
that  the  overdispersion  is  less  than  1.5  x  /i  and  a  95%  chance  that  it  is  less  than  5  x  /i. 
Formulas  (3.15)  and  (3.16)  give  a  lognormal  prior  with  parameters  3.68  and  1.262 
and  5%,  50%,  and  95%  quantiles  of  4.9,  40,  and  316,  respectively.  Figure  3.7(b) 
gives  the  resulting  lognormal  prior  density. 

A  random  walk  Metropolis  algorithm  with  a  normal  proposal  was  constructed 
for  To>  /3i,  b  with  the  variance-covariance  matrix  taken  as  3  times  the  asymptotic 
variance-covariance  matrix  (b  is  asymptotically  independent  of  To  and  i),  based 
on  the  expected  information.  The  posterior  median  and  95%  credible  interval  for 
exp(/3i)  are  0.970  [0.955,0.987],  and  for  b  the  summaries  are  57.8  [34.9,105].  The 
MLE  is  b  =  61.3,  with  asymptotic  95%  confidence  interval  (calculated  on  the  log  b 
scale  and  then  exponentiated)  of  [35.4,106].  Therefore,  likelihood  and  Bayesian 
inference  for  b  are  in  close  agreement  for  these  data.  Histograms  of  samples  from 
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Fig.  3.8  Univariate  and  bivariate  summaries  of  the  posterior  p(/3o,/3i,6  |  y),  arising  from  the 
negative  binomial  model 


the  univariate  posteriors  for  /3o,  /3i,  and  b  are  shown  in  the  first  row  of  Fig.  3.8,  while 
bivariate  scatterplots  are  shown  in  the  second  row.  The  posterior  marginals  for  /30 
and  /?i  are  very  symmetric,  while  that  for  b  is  slightly  skewed. 


3.8.7  Implementation  Summary 

While  MCMC  has  revolutionized  Bayesian  inference  in  terms  of  the  breadth  of 
applications  and  complexity  of  models  that  can  now  be  considered,  other  methods 
may  still  be  preferable  in  some  situations,  in  particular  when  the  number  of 
parameters  is  small.  Direct  sampling  from  the  posterior  is  particularly  appealing 
since  one  retains  all  of  the  advantages  of  sample-based  inference  (e.g.,  the  ability 
to  simply  examine  generic  functions  of  interest),  without  the  need  to  worry  about 
the  convergence  issues  associated  with  MCMC.  Quadrature  methods  are  also 
appealing  for  low-dimensional  problems,  since  they  are  highly  efficient.  The  latter 
is  particularly  important  if  the  calculation  of  the  likelihood  is  expensive.  Importance 
sampling  Monte  Carlo  methods  are  appealing  in  that  error  assessment  may  be 
carried  out;  analytical  approximations  are,  in  general,  poor  in  this  respect. 

INLA  is  very  attractive  due  to  its  speed  of  computation,  though  a  reliable  measure 
of  accuracy  is  desirable  and  there  are  practical  situations  in  which  the  method 
is  not  accurate.  For  example,  the  method  is  less  accurate  for  binomial  data  with 
small  denominators  (Fong  et  al.  2010).  In  exploratory  situations,  one  may  always 
use  quick  methods  such  as  INLA  for  initial  modeling,  with  more  computationally 
demanding  approaches  being  used  when  a  set  of  finals  models  are  honed  in  upon. 
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INLA  is  also  useful  for  performing  simulation  studies  to  examine  the  properties  of 
model  summaries.  In  general,  comparing  results  across  different  methods  is  a  good 
idea.  When  deciding  upon  a  method  of  implementation,  there  is  often  a  clear  trade¬ 
off  between  efficiency  and  the  time  taken  to  code  prospective  methods.  MCMC 
methods  are  often  easy  to  implement,  but  are  not  always  the  most  efficient  (at  least 
not  for  basic  schemes)  and  are  difficult  to  automate.  For  many  high-dimensional 
problems,  MCMC  may  be  the  only  method  that  is  feasible,  although  INLA  may  be 
available  if  the  model  is  of  the  required  form  (a  small  number  of  “non-Gaussian” 
parameters). 

An  important  paper  in  the  history  of  MCMC  is  that  of  Green  (1995)  in  which 
reversible  jump  MCMC  was  introduced.  This  method  can  be  used  in  situations  in 
which  the  parameter  space  is  of  varying  dimension  across  different  models. 


3.9  Exchangeability 

We  now  provide  a  brief  discussion  of  de  Finetti’s  celebrated  representation  theorem 
which  describes  the  form  of  the  marginal  distribution  of  a  collection  of  random 
variables,  under  certain  assumptions.  As  we  will  see,  this  provides  one  way  in  which 
important  modeling  questions  can  be  framed.  We  first  require  the  introduction  of  a 
very  important  concept  in  Bayesian  inference,  exchangeability. 

Definition.  Let  p{y\ , . . . ,  yn )  be  the  joint  density  of  Y\ ,  Yn.  If 

p(yi,---,yn)  =  P(Vw(l),  -  ■  ■  ,Vw(n)) 

for  all  permutations,  ir,  of  {1,  2, . . . ,  n},  then  Yj, . . .  ,Yn  are  (finitely)  exchangeable. 

This  definition  essentially  says  that  the  labels  identifying  the  individual  com¬ 
ponents  are  uninformative.  Obviously  if  a  collection  of  n  random  variables  is 
exchangeable,  this  implies  that  the  marginal  distribution  of  all  single  random 
variables  are  the  same,  as  are  the  marginal  distributions  for  all  pairs,  all  triples,  etc.  A 
collection  of  random  variables  is  infinitely  exchangeable  if  every  finite  subcollection 
is  exchangeable. 

As  a  simple  example,  consider  Bernoulli  random  variables,  Yj,  for  i  =  1, 
2,3  =  n.  Under  exchangeability, 

Pr(Yi  =  a,  Y2  =  b,  Y3  =  c)  =  Pr(Yj  =  a,  Y2  =  c,  Y3  =  b) 

=  Pr(Yi  =  b,Y2  =  a,  Y3  =  c) 

=  Pr(Yi  =  b,Y2  =  c,Y3  =  a) 

=  Pr  (Y1=c,Y2=a,Y3=b) 

=  Pr  (Y1=c,Y2  =  b,Y3  =  a) 


for  all  a1b,c=  0, 1. 
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Result.  If  6  ~  p(6)  and  Y\, . . . ,  Yn  are  conditionally  independent  and  identically 
distributed  given  0,  then  Y\,...,Yn  are  exchangeable. 


Proof.  By  definition: 

p(yi, ■■■,yn)  =  J p(yi, ■  ■  ■ , yn  I  e)it{G)  de 

r.  n 

=  /  i ^ 


IIp(y.w  I  0) 


,i=l 


tt(0)  de 


Viy •  •  •  ;  Vtt (ro)) 


We  now  present  the  converse  of  this  result. 


Theorem,  de  Finetti’s  representation  theorem  for  0/1  random  variables. 


If  Yi,  Yz, . . .  is  an  infinitely  exchangeable  sequence  of  0/1  random  variables,  there 
exists  a  distribution  7r(-)  such  that  the  joint  mass  function  Pr (yi,  ■ .  ■  ,yn)  has 
the  form 


Pr(t/i,...,2/n) 


pi  n 

/  \\eVi{i-e)l-Vi-K(e)de , 

i= 1 


where 


r(u)  du 


lim  Pr  ( —  <  9 

n-t-oo  y  n 


with  Zn  =  Yi  +  . . .  +  Yn,  and  9  =  lim^-nx,  Zn/n. 

Proof.  The  following  is  based  on  Bernardo  and  Smith  (1994).  Let  zn  =  yi+.  ■  .+yn 
be  the  number  of  1  ’s  (which  we  label  "successes”)  in  the  first  n  observations.  Then, 
due  to  exchangeability, 


Pi  {Vl  +  ■  •  •  +  Vn  —  zn )  —  ^  ^  J  Pr(5  7r ( l)  j  •  *  *  )  ^7r (n))> 

for  all  permutations  7r  of  {1, 2, . . . ,  n}  such  that  y^m  +  . . .  +  yn(n)  =  zn.  We  can 
embed  the  event  y\  +  . . .  +  yn  =  zn  within  a  longer  sequence  and 


N  —  (n— zn)  N  —  (n— zn) 

Pr(Yi  +  . . .  +  Yn  =  zn)=  ^2  Pr (zn,zN)=  ^  Pr(^„  |  zn)  Pr(2jv), 

z  N  —ZTX  Z  N  —Zn 

where  Pr(zjv)  is  the  “prior”  belief  in  the  number  of  successes  out  of  N .  To  obtain 
the  conditional  probability,  we  observe  that  it  is  “as  if”  we  have  a  population  of  N 
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items  of  which  zn  are  successes  and  N  —  zn  failures,  from  which  we  draw  n  items. 
The  distribution  of  zn  \  Zn  successes  is  therefore  hypergeometric  so  that 


N—{n—zn 

Pr(t/i  +  . . .  +  yn  =  zn)  =  ^2 


zn\  f  N  -  zN 
Zn  J  \n  -  zn 
N 
n 


Pr(zjv). 


We  now  let  11(0)  be  the  step  function  which  is  0  for  9  <  0  and  has  jumps  of  Pr(zjv) 
at  9  =  zn/N,  Zn  =  0 , ,N.  We  now  let  N  — >  oo.  Then  the  hypergeometric 
distribution  tends  to  a  binomial  distribution  with  parameters  n  and  9  and  the  prior 
Pr(z/v)  is  translated  into  a  prior  for  9,  which  we  write  as  7 r(0).  Consequently, 


Pr(j/i  +  ■■■  +  Vn  =  zn)  ~> 


0*"(1  -0)”-z"7r(0)  d0, 


as  N  — >  oo.  □ 

The  implications  of  this  theorem  are  of  great  significance.  By  the  strong  law  of 
large  numbers,  9  =  limn-^oo  Zn/n,  so  that  7r(-)  represents  our  beliefs  about  the 
limiting  relative  frequency  of  l’s.  Hence,  we  have  an  interpretation  of  9.  Further, 
we  may  view  the  Yt  as  conditional  independent,  Bernoulli  random  variables, 
conditional  on  the  random  variable  9. 

In  conventional  language,  we  have  a  likelihood  function 

n  n 

Pi(yi,  ...,yn\0)  =  Y[p(yi  |  0)  =  J]  °Vi  (!  ^  0)1_y% 

i=l  i=l 

where  the  parameter  9  is  assigned  a  prior  distribution  7 r(0). 

In  general,  if  Yj  .  Y2, . . .  is  an  infinitely  exchangeable  sequence  of  random 
variables,  there  exists  a  probability  density  function  7r(-)  such  that 

P  n 

p(yi,-‘-,yn)  =  I  W_p{yi  I  0)n(9)  dO,  (3.35) 

2=1 

with  p(Y  |  0)  denoting  the  density  function  corresponding  to  the  “unknown 
parameter”  6.  A  sketch  proof  of  (3.35)  may  be  found  in  Bernardo  and  Smith 
(1994).  This  result  tells  us  that  a  conditional  independence  model  can  be  justified 
via  an  exchangeability  argument.  In  this  general  case,  further  assumptions  on 
Y1,Y2, . . .  are  required  to  identify  p(Y  \  0).  Bernardo  and  Smith  (1994)  present 
the  assumptions  that  lead  to  a  number  of  common  modeling  choices.  For  example, 
suppose  that  Y\,  Y2, . . .  is  an  infinitely  exchangeable  sequence  of  random  variables 
such  that  Yi  >  0,  i  =  1, 2, . . ..  Further,  suppose  that  for  any  event  4  in  1  x  . . .  x  1, 
and  for  all  n. 


Pr[(yi,  ...,yn)&A]  =  Pr[(yi, . . . ,  yn)  £  A  +  a\ 
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for  all  a  £  K  x  ...  x  1  such  that  a1  1  =  0  and  A  +  a  is  an  event  in  K  x  . . .  x  R. 
Then  the  joint  density  for  yi , . . . ,  yn  is 

r  oo  n 

p(yi,---,yn)=  /  TT  9~x  exp(— x  7r(0)  dd 
Jo  i= 1 

where  fQ°°  tt(u)  du  =  lim„_>oo  Pr(t/„  <  0)  and  yn  =  (yi  +  ...  +  yn)/n.  For  a 
proof,  see  Diaconis  and  Ylvisaker  (1980).  Fience,  a  belief  in  exchangeability  and  a 
‘‘lack  of  memory”  property  leads  to  the  integral  of  the  predictive  distribution  being 
the  marginal  distribution  that  is  constructed  from  the  product  of  a  conditionally 
independent  set  of  exponential  random  variables  and  a  prior.  The  parameter  is 
identified  as  the  sample  mean  from  a  large  number  of  observations. 

This  kind  of  approach  is  of  theoretical  interest,  but  in  practice  the  choice 
of  likelihood  will  often  be  based  more  directly  on  the  context  and  previous 
experience  with  similar  data  types.  Exchangeability  is  very  useful  in  practice  for 
prior  specification,  however.  Before  one  uses  a  particular  conditional  independence 
model,  one  can  think  about  whether  all  units  are  deemed  exchangeable.  If  some 
collection  of  units  are  distinguishable,  then  one  should  not  assume  conditional 
independence  for  all  units,  and  one  may  instead  separate  the  units  into  groups  within 
which  exchangeability  holds.  For  further  discussion,  see  Sect.  8.6. 

In  terms  of  modeling,  if  we  believe  that  a  sequence  of  random  variables  is 
exchangeable,  this  allows  us  to  write  down  a  conditional  independence  model. 
We  emphasize  that  independence  is  a  very  different  assumption  since  it  implies  that 
we  learn  nothing  from  past  observations: 

PiVm+l,  ■  ■  ■  ,Vn  |  2/1,  •••,2/m)  =  p(ym+l,  ■  ■  ■  ,Vn) 

In  a  regression  context,  the  situation  is  slightly  more  complicated.  Informally, 
exchangeability  within  covariate-defined  groups  gives  the  usual  conditional  inde¬ 
pendence  model,  where  we  now  condition  on  parameters  and  covariates;  Bernardo 
and  Smith  (1994,  Sect.  4.64)  contains  details. 


3.10  Hypothesis  Testing  with  Bayes  Factors 


We  now  turn  to  a  description  of  Bayes  factors,  which  are  the  conventional  Bayesian 
method  for  comparison  of  hypotheses/models.  Let  the  observed  data  be  denoted  y  = 
[y i, . . . ,  yn\,  and  assume  two  hypotheses  of  interest,  //0  and  Hi.  The  application  of 
Bayes  theorem  gives  the  probability  of  the  hypothesis  Hq,  given  data  y.  as 


Pr(ff0  |  V-  H()U  Hi) 


p{y  I  ffo)Pr(ff0  |  H0UHi) 
p(y  I  H0uHi) 
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Table  3.3  Losses  corresponding  to  the 
decision  <5,  when  the  truth  is  H  and  Li 
and  La  are  the  losses  associated  with  type 
I  and  II  errors,  respectively 


L(S,  H) 

Decision 

<5  =  0 

S  =  1 

Truth  H 

Ho 

0 

Li 

H i 

Ln 

0 

where 

p(y  \  H0  UH1)=p(y\  H0)  Pr {H0  \  H0  U  Hx)  +  p{y  \  Hx)  Pr (Hx  \  H0  U  Hx) 

is  the  probability  of  the  data  averaged  over  / / (>  and  11\ .  The  prior  probability  that  H (l 
is  true,  given  one  of  Ho  and  Hx  is  true,  is  Pr(iTo  |  Hq  U  Hx),  and  Pt(Hx  |  Hq  U 
Hx)  =  1  —  l'r(/7o  |  Hq  LJ  II\  )  is  the  prior  on  the  alternative  hypothesis.  This  simple 
calculation  makes  it  clear  that  to  evaluate  the  probability  that  the  null  is  true,  one  is 
actually  calculating  the  probability  of  the  null  given  that  H0  or  H  \  is  true.  Therefore, 
we  are  calculating  the  “relative  truth”;  H0  may  provide  a  poor  fit  to  the  data,  but  Hx 
may  be  even  worse.  Although  conditioning  on  Hq  U  II  \  is  crucial  to  interpretation, 
we  will  drop  it  for  compactness  of  notation. 

If  we  wish  to  compare  models  H0  and  Hx,  then  a  natural  measure  is  given  by  the 
posterior  odds 

Pr(gp  I  y)  =  p{y  I  Ho)  Pr(ffo) 

Pr(#i|  y)  p{y  \  Hx)  Pt(Hx)  ’  ^  ' 

where  the  Bayes  factor 

RF  _P(V  I  Ho) 

P(y  I  hx) 

is  the  ratio  of  the  marginal  distributions  of  the  data  under  the  two  models,  and 
Pr(iTo)/ Pr(-Hi)  is  the  prior  odds.  Care  is  required  in  the  choice  of  priors  when 
Bayes  factors  are  calculated;  see  Sect.  4.3.2  for  further  discussion. 

Depending  on  the  nature  of  the  analysis,  we  may:  simply  report  the  Bayes  factor; 
or  we  may  place  priors  on  the  hypotheses  and  calculate  the  posterior  odds  of  // f) ;  or 
we  may  go  a  step  further  and  derive  a  decision  rule.  Suppose  we  pursue  the  latter 
and  let  6  =  0/1  represent  the  decision  to  pick  Hq/ Hx.  With  respect  to  Table  3.3, 
the  posterior  expected  loss  associated  with  decision  6  is 


E [L(S,  H)\  =  L(S,  H0)  Pr (H0  \  y)  +  L(S,  H i)  Pr (Hx  \  y) 
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so  that  for  the  two  possible  decisions  (accept/reject  Hq)  the  expected  losses  are 

E [L{5  =  0,  H)}  =  0  x  Pr {H0  \  y)  +  Ln  x  Pr(Jff1  |  y) 

E[L(S  =  1,  H)}  =  LjX  Pr{H0  |  y)  +  0  x  Pr(iTi  |  y). 


To  find  the  decision  that  minimizes  posterior  expected  cost,  let  v  =  Pr(  II\  \  y ) 
so  that 


E[L(6  =  0,H)]  =  Lnxv 
E [L{6  =  1,  H)]  =LIx  (1-n). 


(3.37) 

(3.38) 


We  should  choose  <5  =  1  if  Lu  x  v  >  2^(1  —  v),  that  is,  if  v/(l  —  v)  >  Li/Lu,  or 
v  >  Li/ {Li  +  La).  Hence,  we  report  Hi  if 


illustrating  that  we  only  need  to  specify  the  ratio  of  losses.  If  incorrect  decisions 
are  equally  costly,  we  should  therefore  report  the  hypothesis  that  has  the  greatest 
posterior  probability,  in  line  with  intuition.  These  calculations  can  clearly  be 
extended  to  three  or  more  hypotheses.  The  models  that  represent  each  hypothesis 
need  not  be  nested  as  with  likelihood  ratio  tests,  though  careful  prior  choice  is 
required  so  as  to  not  inadvertently  favor  one  model  over  another.  One  remedy  to 
this  difficulty  is  described  in  Sect.  6.16.3. 

To  evaluate  the  Bayes  factor,  we  need  to  calculate  the  normalizing  constants 
under  Hq  and  Hi.  A  generic  normalizing  constant  is 


(3.39) 


We  next  derive  a  popular  approximation  to  the  Bayes  factor.  The  integral  (3.39) 
is  an  integral  of  the  form  (3.18)  with 

nh.(O)  =  log p(y  \  6)  +  log7r(0). 

Letting  0  denote  the  posterior  mode,  we  may  apply  (3.20)  with  nh{6 )  =  log p{y  | 
6)  +  log  7 r(0)  to  give  the  Laplace  approximation 

—  —  P  p  1  ^ 

log p{y)  =  log p(y  |  0)  +  log  n{0)  +  -  log  2?r  -  -  log n  +  -  log  |  v  \  . 

As  n  increases,  the  prior  contribution  will  become  negligible,  and  the  posterior 
mode  will  be  close  to  the  MLE  9.  Dropping  terms  of  0(1),  we  obtain  the  crude 
approximation 


21ogp(y)  «  — 21ogp(y  |  9)  +plogn. 
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Let  hypothesis  Hj  be  indexed  by  parameters  9j  of  length  pj  and  9j  denote  the 
MLEs  for  j  =  0,1.  Without  loss  of  generality,  assume  po  <  p-\ .  We  may 
approximate  twice  the  log  Bayes  factor  by 


2[logp(y  |  H0)  -  log p{y  \  Hi)  ] 


=  2 
=  2 


log p{y  |  90)  -  log p{y  \  9 1)  +  ( Pi  -po)\ogn 
l(d0)-l(9i)  +  (pi  —  po)  logn  (3.40) 


which  is  the  log-likelihood  ratio  statistic  (see  Sect.  2.9.5)  with  the  addition  of 
a  term  that  penalizes  complexity;  (3.40)  is  known  as  the  Bayesian  information 
criteria  (BIC).  The  Schwarz  criterion  (Schwarz  1978)  is  the  BIC  divided  by  2.  If 
the  maximized  likelihoods  are  approximately  equal,  then  model  Hq  is  preferred 
if  Po  <  pi,  as  it  contains  fewer  parameters.  As  n  increases,  the  penalty  term 
increases  in  size  showing  the  difference  in  behavior  with  frequentist  tests  in  which 
significance  levels  are  often  kept  constant  with  respect  to  sample  size.  A  more 
detailed  comparison  of  Bayesian  and  frequentist  approaches  to  hypothesis  testing 
will  be  carried  out  in  Chap.  4. 


3.11  Bayesian  Inference  Based  on  a  Sampling  Distribution 

We  now  describe  an  approach  to  Bayesian  inference  which  is  pragmatic  and 
computationally  simple  and  allows  frequentist  summaries  to  be  embedded  within 
a  Bayesian  framework.  This  is  useful  in  situations  in  which  one  would  like  to 
examine  the  impact  of  prior  specification.  It  is  also  appealing  to  examine  frequentist 
procedures  with  no  formal  Bayesian  justification  from  a  Bayesian  slant.  Suppose 
we  are  in  a  situation  in  which  the  sample  size  n  is  sufficiently  large  for  accurate 
asymptotic  inference  and  suppose  we  have  a  parameter  9  of  length  p.  The  sampling 
distribution  of  the  estimator  is 


9n  |  9  ~  Np(0,  Vn), 

where  Vn  is  assumed  known.  The  notation  here  is  sloppy;  it  would  be  more  accurate 
to  state  the  distribution  as 


Vn~1/2(9n-9)  ~  Np(0,I). 

Appealing  to  conjugacy,  it  is  then  convenient  to  combine  this  “likelihood”  with  the 
prior  9  ~  Nl,(m.  W)  to  give  the  posterior 


0|0n~Np(m*,W*) 


(3.41) 
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where 

W*  =  (W-1  +  V-1)-1 

m*=WZ(W-1m  +  V-1dn) 


The  posterior  distribution  is  therefore  easy  to  determine  since  we  only  require  a 
point  estimate  6n,  with  an  associated  variance-covariance  matrix,  and  specification 
of  the  prior  mean  and  variance-covariance  matrix. 

An  even  more  straightforward  approach,  when  a  single  parameter  is  of  interest,  is 
to  ignore  the  remaining  nuisance  parameters  and  focus  only  on  this  single  estimate 
and  standard  error.  There  are  a  number  of  advantages  to  this  approach,  not  least 
of  which  is  the  removal  of  the  need  for  prior  specification  over  the  nuisance 
parameters.  Let  9  denote  the  parameter  of  interest  and  a  the  (p  x  1)  vector  of 
nuisance  parameters.  Following  Wakefield  (2009a),  we  give  a  derivation  beginning 
with  the  asymptotic  distribution  (we  drop  the  explicit  dependence  on  n  for  notational 
convenience): 


a 

9 


ho  lot 
Ai  hi. 


(3.42) 


where  loo  is  the  p  x  p  expected  information  matrix  for  a.  In  is  the  information 
concerning  0,  and  lot  is  the  p  x  1  vector  of  cross  terms.  We  now  reparameterize  the 
model  and  consider  (a,  9)  (7, 9)  where 


7  =  a 


1 00 


which  yields 


7 

0 


(3.43) 


where  7  =  S  +  (hi  /loo)  9  and  0  is  ap  x  1  vector  of  zeros.  Hence,  asymptotically, 
the  “likelihood”  factors  into  independent  pieces 


P(  7, 0\l  ,9)=  p(  7  |  7)  x  P(9  |  0). 


We  now  assume  independent  priors  on  7  and  9, 7r(7,  9)  =  ^(7)7 r(0),  to  give 

P(  1,9  |  7,0)  =p{ 7  |  7)tt(7)p(0  I  0)7r(0) 

=  P( 7  I  7 )p(9  I  0) 

so  that  the  posterior  factors  also  and  we  can  concentrate  on  p(9  \  9)  alone.  The 
simple  model 


9  |  9  ~  N(0,  V) 

9  ~  N (to,  W) 
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therefore  results  in  the  posterior 


9  I  9  ~  N 


r  — 1\  — 1 


(' W~L  +  V-L)-l{W~lm  +  V-'O),  (W~L  +  V-1) 


(3.44) 


The  above  approach  is  similar  to  the  “null  orthogonality”  reparameterization  of  Kass 
and  Vaidyanathan  (1992).  The  reparameterization  is  also  that  which  is  used  when 
the  linear  model 

Yi  =  a  +  Xid  +  ti 

is  written  as 

Yi  =  7  +  (xi  -  x)9  +  ei 

which,  of  course,  yields  uncorrelated  least  squares  estimators  7, 9.  The  reparame¬ 
terization  trick  works  because  of  the  assumption  of  independent  priors  on  7  and 
9  which,  of  course,  does  not  imply  independent  priors  on  a  and  9.  However,  we 
emphasize  that  we  do  not  need  to  explicitly  specify  priors  on  7,  because  the  terms 
involving  7  cancel  in  the  calculation. 

Bayes  factors  can  also  be  simply  evaluated  under  either  of  the  approxima¬ 
tions,  (3.41)  or  (3.44).  To  illustrate  for  the  latter,  suppose  9  is  univariate,  and  we 
wish  to  compare  the  hypotheses 


H0  :  9  =  0,  Hi  :  9  ±  0, 


with  the  prior  under  the  alternative,  9  ~  N(0,  W).  The  Bayes  factor  is 

p(0  |  0o) 


BF  = 


Jp(9  |  9)tt(6)  d9 


V  +  W 


V 


■  exp 


1  92 


W 


2  V  V  +  W 


(3.45) 


This  approach  allows  a  Bayesian  interpretation  of  published  results,  since  all  that 
is  required  for  calculation  of  (3.45)  is  9  and  V,  which  may  be  derived  from  a 
confidence  interval  or  the  estimate  with  its  associated  standard  error. 

More  controversially,  an  advantage  of  the  use  of  the  asymptotic  distribution  of 
the  MLE  only  is  that  the  Bayes  factor  calculation  may  be  based  on  nonstandard 
likelihoods  or  estimating  functions  which  do  not  have  formal  Bayesian  justifica¬ 
tions.  For  example,  the  estimate  and  standard  error  may  arise  from  conditional  or 
marginal  likelihoods  (as  described  in  Sect.  2.4.2),  or  using  sandwich  estimates  of 
the  variance.  As  discussed  in  Chap.  2,  a  strength  of  modern  frequentist  methods 
based  on  estimating  functions  is  that  estimators  are  produced  that  are  consistent 
under  much  milder  assumptions  than  were  used  to  derive  the  estimators  (e.g.,  the 
estimator  may  be  based  on  a  score  equation,  but  the  variance  estimate  may  not 
require  the  likelihood  to  be  correctly  specified).  The  use  of  a  consistent  variance 
estimate  with  (3.45)  allows  the  benefits  of  frequentist  sandwich  estimation  and 
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Bayesian  prior  specification  to  be  combined.  Bayesian  hypothesis  testing  may  also 
be  based  on  frequentist  summaries.  Exercises  3.10  and  3.11  give  further  details  on 
the  approach  described  in  this  section,  including  the  extension  to  having  estimators 
and  standard  errors  from  multiple  studies. 


3.12  Concluding  Remarks 

Bayesian  analyses  should  not  be  restricted  to  convenient  likelihoods  and  like¬ 
lihood/prior  combinations;  this  is  especially  true  with  the  advent  of  modern 
computational  approaches.  However,  one  still  needs  to  be  careful  that  the  sampling 
scheme  (i.e.,  the  design)  is  acknowledged  by  the  likelihood  specification  and  that 
the  likelihood/prior  combination  leads  to  a  proper  posterior. 

We  now  follow  up  on  Sect.  1.6  and  describe  situations  in  which  frequentist 
and  Bayesian  methods  are  likely  to  agree  and  when  one  is  preferable  over  the 
other.  We  concentrate  on  estimation  since  point  and  interval  estimation  are  directly 
comparable  under  the  two  paradigms.  For  model  comparison,  the  objectives  of 
Bayes  factors  and  hypothesis  tests  are  fundamentally  different  (see,  e.g.,  Berger 
(2003)),  and  so  comparison  is  more  difficult.  Chapter  4  compares  and  critiques 
frequentist  and  Bayesian  approaches  to  hypothesis  testing. 

On  a  philosophical  level,  the  Bayesian  approach  is  satisfying  since  one  simply 
follows  the  rules  of  probability  as  applied  to  the  unknowns  whether  they  be 
parameters  or  hypotheses.  This  is  in  stark  contrast  to  the  frequentist  approach  in 
which  the  parameters  are  fixed.  Consequently,  credible  intervals  are  probabilistic 
and  easily  interpretable,  and  posterior  distributions  on  parameters  of  interest  are 
obtained  through  marginalization.  Another  appealing  characteristic  is  that  the 
Bayesian  approach  to  inference  may  be  formally  derived  via  decision  theory;  see, 
for  example,  Bernardo  and  Smith  (1994).  A  concept  that  has  received  a  lot  of 
discussion  is  the  likelihood  principle  (Berger  and  Wolpert  1988;  Royall  1997) 
which  states  that  the  likelihood  function  contains  all  relevant  information.  So  two 
sets  of  data  with  proportional  likelihoods  should  lead  to  the  same  conclusion.  The 
likelihood  principle  leads  one  toward  a  Bayesian  approach  since  all  frequentist 
criteria  invalidate  this  principle,  and  a  true  likelihood  approach  as  followed  by, 
for  example,  Royall  (1997)  is  difficult  to  calibrate.  The  likelihood  principle  is  a 
cornerstone  of  many  Bayesian  developments,  but  in  this  book  we  follow  a  far  more 
pragmatic  approach  and  so  do  not  provide  further  details  on  this  topic. 

In  contrast,  the  frequentist  approach  is  more  difficult  to  justify  on  philosophical 
grounds.  Instead,  much  theory  has  been  developed  in  terms  of  optimality  within 
a  frequentist  set  of  guidelines.  For  example,  as  discussed  in  Sect.  2.8,  there  is  a 
Gauss-Markov  theorem  for  linear  estimating  functions  (Godambe  and  Heyde  1987; 
McCullagh  1983),  while  Crowder  (1987)  considers  the  optimality  of  quadratic 
estimating  functions. 

We  have  seen  that,  so  long  as  the  prior  does  not  exclude  regions  of  the  parameter 
space,  Bayesian  estimators  have  similar  frequentist  properties  to  MLEs.  The  greatest 
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drawback  of  the  Bayesian  approach  is  the  need  to  specify  both  a  likelihood  and 
a  prior  distribution.  Sensitivity  to  each  of  these  components  can  be  examined, 
but  carrying  out  such  an  endeavor  in  practice  is  difficult  and  one  is  then  faced 
with  the  difficulty  of  how  results  should  be  reported.  The  frequentist  approach  to 
model  misspecification  is  quite  different,  and  the  use  of  sandwich  estimation  to 
give  a  consistent  standard  error  is  very  appealing.  There  is  no  Bayesian  approach 
analogous  to  sandwich  estimation,  but  see  Szpiro  et  al.  (2010)  for  some  progress  on 
a  Bayesian  justification  of  sandwich  estimation. 

For  small  n,  Bayesian  methods  are  desirable;  in  an  extreme  case  if  the  number  of 
parameters  exceeds  n,  then  a  Bayesian  approach  (or  some  form  of  penalization,  see 
Chaps.  10-12)  must  be  followed.  In  this  situation  there  is  no  way  that  the  likelihood 
can  be  checked  and  inference  will  be  sensitive  to  both  likelihood  and  prior  choices. 
When  the  model  is  very  complex,  then  Bayesian  methods  are  again  advantageous 
since  they  allow  a  rigorous  treatment  of  nuisance  parameters;  MCMC  has  allowed 
the  consideration  of  more  and  more  complicated  hierarchical  models,  for  example. 
Spatial  models,  particularly  those  that  exploit  Markov  random  field  second  stages, 
provide  a  good  example  of  models  that  are  very  naturally  analyzed  using  MCMC 
or  INLA,  where  the  conditional  independencies  may  be  exploited;  see  Sect.  9.7 
for  an  illustrative  example.  Unfortunately,  assessments  of  the  effects  of  model 
misspecification  are  difficult  for  such  complex  models;  instead  sensitivity  studies 
are  again  typically  carried  out.  Consistency  results  under  model  misspecification 
are  difficult  to  come  by  for  complex  models  (such  as  those  discussed  in  Chap.  9). 
Bayesian  methods  are  also  appealing  in  situations  in  which  the  maximum  likelihood 
estimator  provides  a  poor  summary  of  the  likelihood,  for  example,  in  variance 
components  problems. 

If  n  is  sufficiently  large  for  asymptotic  normality  of  the  sampling  distribution  to 
be  accurate,  then  frequentist  methods  have  advantages  over  Bayesian  alternatives. 
In  particular,  as  just  mentioned,  sandwich  estimation  can  be  used  to  provide  a 
consistent  estimator  of  the  variance-covariance  matrix  of  the  estimator.  Hence,  if 
the  estimator  is  consistent,  reliable  confidence  coverage  will  be  guaranteed.  We 
stress  that  n  needs  to  be  sufficiently  large  for  the  sandwich  estimator  to  be  stable. 
A  typical  Bayesian  approach  would  be  to  increase  model  complexity,  often  through 
the  introduction  of  random  effects.  The  difficulty  with  this  is  that  although  more 
flexibility  is  achieved,  a  specific  form  needs  to  be  assumed  for  the  mean-variance 
relationship,  in  contrast  to  sandwich  estimation. 

We  briefly  mention  two  topics  which  have  not  been  discussed  in  this  chapter. 
The  linear  Bayesian  method  (Goldstein  and  Wooff  2007)  is  an  appealing  approach 
in  which  Bayesian  inference  is  carried  out  on  the  basis  of  expectation  rather  than 
probability.  The  appeal  comes  from  the  removal  of  the  need  to  specify  complete 
prior  distributions,  rather  the  means  and  variances  of  the  parameters  only  require 
specification.  The  deviance  information  criterion  (DIC)  is  a  popular  approach  for 
comparison  of  models  that  was  introduced  by  Spiegelhalter  et  al.  (1998).  The 
method  is  controversial,  however,  as  the  discussion  of  the  aforementioned  paper 
makes  clear;  see  also  Plummer  (2008). 
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3.13  Bibliographic  Notes 

Bayes’  original  paper  was  published  posthumously  as  Bayes  (1763).  The  book  by 
Jeffreys  was  highly  influential:  the  original  edition  was  published  in  1939  and  the 
third  edition  as  Jeffreys  (1961).  Other  influential  works  include  Savage  (1972)  and 
translations  of  de  Finetti’s  books,  De  Finetti  (1974,  1975). 

Bernardo  and  Smith  (1994)  provide  a  thorough  description  of  the  decision- 
theoretic  justification  of  the  Bayesian  approach.  O’ Hagan  and  Forster  (2004)  give  a 
good  overview  of  Bayesian  methodology  and  Gelman  et  al.  (2004)  and  Carlin  and 
Louis  (2009)  descriptions  with  a  more  practical  flavor.  Robert  (2001)  provides  a 
decision-theoretic  approach.  Hoff  (2009)  is  an  excellent  introductory  text. 

Approaches  to  addressing  the  sensitivity  of  inference  to  different  prior  choices, 
are  described  in  O’Hagan  (1994,  Chap.  7).  A  good  overview  of  methods  for 
integration  is  provided  by  Evans  and  Swartz  (2000).  Lindley  (1980),  Tierney  and 
Kadane  (1986),  and  Kass  et  al.  (1990)  provide  details  of  the  Laplace  method  in 
a  Bayesian  context.  Devroye  (1986)  provides  an  excellent  and  detailed  overview 
of  random  variate  generation.  Smith  and  Gelfand  (1992)  emphasize  the  duality 
between  samples  and  densities  and  illustrate  the  use  of  simple  rejection  algorithms 
in  a  Bayesian  context.  Gamerman  and  Lopes  (2006)  provides  an  introduction 
to  MCMC;  an  up-to-  date  summary  may  be  found  in  Brooks  et  al.  (2011). 
Computational  techniques  that  have  not  been  discussed  include  reversible  jump 
Markov  chain  Monte  Carlo  (Green  1995)  which  may  be  used  when  the  parameter 
space  changes  dimension  across  models,  variational  approximations  (Jordan  et  al. 
1999;  Ormerod  and  Wand  2010),  and  approximate  Bayesian  computation  (ABC) 
(Beaumont  et  al.  2002;  Fearnhead  and  Prangle  2012).  Kass  and  Raftery  (1995)  give 
a  review  of  Bayes  factors,  including  a  discussion  of  computation  and  prior  choice. 
Johnson  (2008)  discusses  the  use  of  Bayes  factors  based  on  summary  statistics. 


3.14  Exercises 

3.1  Derive  the  posterior  mean  and  posterior  quantiles  as  the  solution  to  quadratic 
and  linear  loss,  respectively,  as  described  in  Sect.  3.2. 

3.2  Consider  a  random  sample  1)  |  6  ~ad  N(0,  a2),  i  =  1, . . . ,  n,  with  9  unknown 
and  cr2  known. 

(a)  By  writing  the  likelihood  in  exponential  family  form,  obtain  the  conjugate 
prior  and  hence  the  posterior  distribution. 

(b)  Using  the  conjugate  formulation,  derive  the  predictive  distribution  for 
a  new  univariate  observation  Z  from  N(0,  cr2),  assumed  conditionally 
independent  of  Yi, . . . ,  Yn. 

3.3  Consider  the  Neyman-Scott  problem  in  which  Yij  |  ^q,<T2  N(/Xi,cr2), 

*  =  1,-  ■  •  ,n,j  =  1,2. 
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Table  3.4  Case-control  data:  Y  =  1  cor¬ 
responds  to  the  event  of  esophageal  cancer, 
and  X  =  1  exposure  to  greater  than  80  g  of 
alcohol  per  day 


X  =  0 


X  =  1 


Y  =  1  104 

Y  =  0  666 


96 

109 


200 

775 


(a)  Show  that  Jeffreys  prior  in  this  case  is 


tt(/4i,  •  - .  ,/r„,cr2)  oc  a 


(b)  Derive  the  posterior  distribution  corresponding  to  this  prior  and  show  that 


(c)  Hence,  using  Exercise  2.6,  show  that  E[er2  |  y]  — >  <j2 /2  as  n  — >  oo,  so 
that  the  posterior  mean  is  inconsistent. 

(d)  Examine  the  posterior  distribution  corresponding  to  the  prior 


7r(Ml>  •  •  •  ,Mn,Cr2)  oc  cr 


(e)  Is  the  posterior  mean  for  a2  consistent  in  this  case? 

3.4  Consider  the  data  given  in  Table  3.4,  which  are  a  simplified  version  of 
those  reported  in  Breslow  and  Day  (1980).  These  data  arose  from  a  case- 
control  study  (Sect.  7.10)  that  was  carried  out  to  investigate  the  relationship 
between  esophageal  cancer  and  various  risk  factors.  There  are  200  cases  and 
775  controls.  Disease  status  is  denoted  Y  with  Y  =  0/1  corresponding 
to  without/with  disease,  and  alcohol  consumption  is  represented  by  X  with 
X  =  0/1  denoting  <  80  g/  >  80  g  on  average  per  day.  Let  the  probabilities 
of  high  alcohol  consumption  in  the  cases  and  controls  be  denoted 


Pi  =  Pr(X  =  1  |  Y  =  1)  and  p2  =  Pr(X  =  1  |  Y  =  0) 


respectively.  Further,  let  X\  be  the  number  exposed  from  n\  cases  and  X2  be 
the  number  exposed  from  n2  controls.  Suppose  X,  \  p,  ~  Binomial(n,,  p,  )  in 
the  case  (i  =  1)  and  control  (i  =  2)  groups. 

(a)  Of  particular  interest  in  studies  such  as  this  is  the  odds  ratio  defined  by 


Pr(Y  =  1\X  =  1)/Pr(y  =  0  |  X  =  1) 
Pr(Y  =  1  |  X  =  0)/Pr(Y  =  0  |  X  =  0) ' 
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Show  that  the  odds  ratio  is  equal  to 


Pr(X  =  1  I  y  =  1)/  Pr(X  =  0  I  Y  =  1)  _  Pl/{1  -  Pl) 
Pr(X  =  1  |  Y  =  0)/  Pr(X  =  0  |  Y  =  0)  p2/(  1  -  p2) ' 


(b)  Obtain  the  MLE  and  a  90%  confidence  interval  for  6 ,  for  the  data  of 
Table  3.4. 

(c)  We  now  consider  a  Bayesian  analysis.  Assume  that  the  prior  distribution 
for  pi  is  the  beta  distribution  Be(a,  b )  for  i  =  1,2.  Show  that  the  posterior 
distribution  pi  |  Xi  is  given  by  the  beta  distribution  Be(a  +  cCi,h  +  ni  —  a,’i), 
i  =  1,2. 

(d)  Consider  the  case  a  =  b  =  1.  Obtain  expressions  for  the  posterior  mean, 
mode,  and  standard  deviation.  Evaluate  these  posterior  summaries  for  the 
data  of  Table  3.4.  Report  90%  posterior  credible  intervals  for  pi  and  p2. 

(e)  Obtain  the  asymptotic  form  of  the  posterior  distribution  and  obtain  90% 
credible  intervals  for  pi  and  p2.  Compare  this  interval  with  the  exact 
calculation  of  the  previous  part. 

(f)  Simulate  samples  p[\p2\  t  =  1  =  1,000  from  the  posterior 

distributions  pi  |  x\  and  p2  \  x2.  Form  histogram  representations  of  the 
posterior  distributions  using  these  samples,  and  obtain  sample-based  90% 
credible  intervals. 

(g)  Obtain  samples  from  the  posterior  distribution  of  9  \  x\,x2  and  provide 
a  histogram  representation  of  the  posterior.  Obtain  the  posterior  median 
and  90%  credible  interval  for  6  \  x±,x2  and  compare  with  the  likelihood 
analysis. 

(h)  Suppose  the  rate  of  esophageal  cancer  is  17  in  100,000.  Describe  how  this 
information  may  be  used  to  evaluate 


9i  =  Pr(F  =  1  |  X  =  1)  and  q0  =  Pr(Y  =  1  |  X  =  0). 


3.5  Prove  that  if  global  balance,  as  given  by  (3.31),  holds  then  7r(-)  is  the  invariant 
distribution,  that  is, 


for  all  measurable  sets  A. 

3.6  Prove  that  the  Metropolis-Hastings  algorithm,  defined  through  (3.33),  has 
invariant  distribution  7r(-),  by  showing  that  detailed  balance  (3.31)  holds. 

3.7  We  consider  the  data  described  in  the  example  at  the  end  of  Sect.  3.7.7 
concerning  the  leukemia  count,  Y,  assumed  to  follow  a  Poisson  distribution 
with  mean  E  x  6.  Consider  the  y  =  4  observed  leukemia  cases  in  Seascale, 
with  expected  number  of  cases  E  =  0.25.  Previously  in  this  chapter,  a 
lognormal  prior  was  assumed  for  S.  In  this  exercise,  a  conjugate  gamma  prior 
will  be  used. 
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(a)  Show  that  with  a  Ga(a,6)  prior,  the  posterior  distribution  for  S  is  a 
gamma  distribution  also.  Hence,  determine  the  posterior  mean,  mode, 
and  variance.  Show  that  the  posterior  mean  can  be  written  as  a  weighted 
combination  of  the  MLE  and  the  prior  mean.  Similarly  write  the  posterior 
mode  as  a  weighted  combination  of  the  MLE  and  the  prior  mode. 

(b)  Determine  the  form  of  the  prior  predictive  Pr(y)  and  show  that  it 
corresponds  to  a  negative  binomial  distribution. 

(c)  Obtain  the  predictive  distribution  Pr(x  |  y)  for  the  number  of  cases  z  in  a 
future  period  of  time  with  expected  number  of  cases  E*. 

(d)  Obtain  the  posterior  distribution  under  gamma  prior  distributions  with 
parameters  a  =  b  =  0.1,  a  =  b  =  1.0,  and  a  =  b  =  10.  Determine 
the  5%,  50%,  and  95%  posterior  quantiles  in  each  case  and  comment  on 
the  sensitivity  to  the  prior. 

3.8  Consider  a  situation  in  which  the  likelihood  may  be  summarized  as 

Vn(F„-Ai)  N(0,  a2), 

where  Yn  =  E  1  Yi,  with  a2  known,  and  the  prior  for  p  is  the  Cauchy 

distribution  with  parameters  0  and  1,  that  is, 

p(p)  =  —pT, — >  -  °o  <  p  <  oo. 

7T(1  +  pZ) 

We  label  this  likelihood-prior  combination  as  model  Mc. 

(a)  Describe  a  rejection  algorithm  for  obtaining  samples  from  the  posterior 
distribution,  with  the  proposal  density  taken  as  the  prior. 

(b)  Implement  the  rejection  algorithm  for  the  case  in  which  y  =  0.2,  cr2=2 
and  n  =  10.  Provide  a  histogram  representation  of  the  posterior,  and 
evaluate  the  posterior  mean  and  variance.  Also  obtain  an  estimate  of  the 
normalizing  constant,  p{y  \  Mc). 

(c)  Describe  an  importance  sampling  algorithm  for  evaluating  p(y  \  Mc), 
E[m  |  y ,  Mc],  and  var(^i  |  y ,  Mc). 

(d)  For  the  data  of  part  (b),  implement  the  importance  sampling  algorithm, 
and  calculate  p(y  \  Mc )  and  E[/i  |  y,  Mc]  and  var(^t  |  y,  Mc). 

(e)  Now  assume  that  the  prior  for  p,  is  the  normal  distribution  N(0,  0.4). 
Denote  this  model  Mn.  Obtain  the  form  of  the  posterior  distribution  in 
this  case. 

(f)  For  the  data  of  part  (b),  obtain  the  normalizing  constant  p(y  \  Mn )  and 
the  posterior  mean  and  variance.  Compare  these  summaries  with  those 
obtained  under  the  Cauchy  prior.  Interpret  the  ratio 

p(y  I  Mn) 
p(y  I  mc)  ’ 

that  is,  the  Bayes  factor,  for  these  data. 
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Table  3.5  Genetic  data  from  an  experiment  carried  out 
by  Mendel  that  concerned  the  numbers  of  peas  that  were 
classified  by  their  shape  and  color 


Round 

Wrinkled 

Round 

Wrinkled 

yellow 

yellow 

green 

green 

Total 

ni 

n2 

n3 

77,4 

n+ 

315 

101 

108 

32 

556 

3.9  The  data  in  Table  3.5  result  from  one  of  the  famous  experiments  carried  out 
by  Mendel  in  which  pure  bred  peas  with  wrinkled  green  seeds  were  crossed 
with  pure  bred  peas  with  wrinkled  green  seeds.  These  data  are  given  on  page 
15  of  the  English  translation  (Mendel  1901)  of  Mendel  (1866).  All  of  the 
first-generation  hybrids  had  round  yellow  seeds  (since  this  characteristic  is 
dominant),  but  when  these  plants  were  self-pollinated,  four  different  pheno¬ 
types  (characteristics)  were  observed  and  are  displayed  in  Table  3.5. 

A  model  for  these  data  is  provided  by  the  multinomial  M4(n+,p)  where 
p  =  [pi,P2,P3,P4\Ti  and  pj  denotes  the  probability  of  falling  in  cell  j,  j  = 
1, . . . ,  4,  that  is, 


Pr(7V  =  n\p)=  4n+!  T|  p]\ 

-  i  Tlj  i 

Ilj  =  l  J  j—1 


where  N  =  [iVi, . . . ,  iV4]T  and  n  =  [m, . . . ,  n4]T.  In  this  exercise  a  Bayesian 
analysis  of  these  data  will  be  carried  out  using  the  conjugate  Dirichlet  prior 
distribution,  Dir(ai,  02, 03,  a4): 


Pip) 


nl-.rfo,) 


4 


where  ay  >  0,  j  =  1, . . . ,  4,  are  specified  a  priori. 

(a)  Show  that  the  marginal  prior  distributions  for  p:i  are  the  beta  distributions 

Be(ay,  a  —  ay),  where  a  =  1  ai- 

(b)  Obtain  the  distributional  form,  and  the  associated  parameters,  of  the 
posterior  distribution  p(p  \  n). 

(c)  For  the  genetic  data  and  under  a  prior  for  p  that  is  uniform  over  the  simplex 
(i.e.,  ai  =  a2  =  a3  =  a4  =  1),  evaluate  E \pj  \  n]  and  s.d .{pj  \  n), 
j  =  !,•••,  4. 

(d)  Obtain  histogram  representations  and  90%  credible  intervals  for  pj  \  n , 
J  =  !,•••,  4. 

(e)  Determine  the  form  of  the  predictive  distribution  for  [Ni,  N2,  N3,  N4] 
given  n+  =  Yhj  nj-  Describe  how  a  sample  from  this  predictive  distri¬ 
bution  could  be  obtained. 
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A  particular  model  of  interest  is  that  which  states  that  genes  are 
inherited  independently  of  each  other,  so  that  the  ratio  of  counts  is 
9:3:3: 1 , or 


9  3  3  1 

Ho  ■  Pio  =  77 ,  P20  =  77 ,  P30  =  77 ,  P40  =  77  • 
16  16  16  16 


The  evidence  in  favor  of  this  model,  versus  the  alternative  of  Hi  :  p 
unspecified,  will  now  be  determined. 

(f)  For  the  data  in  Table  3.5,  carry  out  a  likelihood  ratio  test  comparing  H q 
and  H\ . 

(g)  Obtain  analytical  expressions  for  Pr(n  |  Hq)  and  Pr(n  |  Hi). 

(h)  Evaluate  the  Bayes  factor  Pr(n  |  Hq)/  Pr(n  |  Hi)  for  the  genetic  data. 
Comment  on  the  evidence  for/against  H0  and  compare  with  the  conclusion 
from  the  likelihood  ratio  test  statistic. 

3.10  With  respect  to  Sect.  3.11,  consider  the  "likelihood,”  9  \  9  ~  N(0,  V )  and  the 
prior  9  ~  N(0,  W).  Show  that  9  \  9  ~  N (r9,  rV )  where  r  =  Wj (V  +  W). 

3.11  Again  consider  the  situation  discussed  in  Sect.  3.11  in  which  a  Bayesian 
analysis  is  carried  out  based  not  on  the  full  data  but  rather  on  summary 
statistics. 


(a)  Suppose  data  are  to  be  combined  from  two  studies  with  a  common 
underlying  parameter  9.  The  estimates  from  the  two  studies  are  9i ,  02 
with  standard  errors  vk  1  and  W 2  (with  the  two  estimators  being 
conditionally  independent  given  9).  Show  that  the  Bayes  factor  that 
summarizes  the  evidence  from  the  two  studies,  that  is, 


p(9i,82  I  H0) 
p(9i,92  |  Hi)’ 


takes  the  form 


BF(01;02)  = 


w 


RViV2 


exp 


\(% 


RV2 


2Z1Z2R^/viV2  +  Z^RVi 


where  R  =  W/{VA  W  +  V2W  +  ViV2)  and  Zx  =  6i/yJV^  and  Z2  = 
92/VV2  are  the  usual  Z-statistics. 

(b)  Suppose  now  there  are  K  studies  with  estimates  9k  and  asymptotic 
variances  14,  k  =  1 , ,K,  and  again  assume  a  common  underlying 
parameter  9.  Show  that  the  Bayes  factor 

p(9i,  ...,6  k  I  H0) 

P(9i,...,8k\Hi)’ 
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takes  the  form 
BF(0i 


nf=i(2-14)  1/2  exp  (-|^) 


1/2  exp  - 


,9t-9) 

2Vk 


(2-kW)  f/2  exp  (-  ^7)  d# 


N 


K 

W[W~t  +  J2Vk1  lexp 

k= 1 


fc=i 


Further,  show  that  the  posterior  summarizing  beliefs  about  9  given  the  K 
estimates  is 

9\9ll...19K~N  (^a2) 


where 


and 


K 


K 


\k=  1 


Vk 


-1 

k 


k=l 


K 


=  [w-1+J2vk1 


k= 1 


1 


Chapter  4 

Hypothesis  Testing  and  Variable  Selection 


4.1  Introduction 

In  Sects.  2.9  and  3.10,  we  briefly  described  the  frequentist  and  Bayesian  machinery 
for  carrying  out  hypothesis  testing.  In  this  chapter  we  extend  this  discussion, 
with  an  emphasis  on  critiquing  the  various  approaches  and  on  hypothesis  testing 
in  a  regression  setting.  We  examine  both  single  and  multiple  hypothesis  testing 
situations;  Sects.  4.2  and  4.3  consider  the  frequentist  and  Bayesian  approaches, 
respectively.  Section  4.4  describes  the  well-known  leffreys-Lindley  paradox  that 
highlights  the  starkly  different  conclusions  that  can  occur  when  frequentist  and 
Bayesian  hypothesis  testing  is  carried  out.  This  is  in  contrast  to  estimation,  in  which 
conclusions  are  often  in  agreement.  In  Sects.  4. 5-4. 7,  various  aspects  of  multiple 
testing  are  considered.  The  discussion  includes  situations  in  which  the  number  of 
tests  is  known  a  priori  and  variable  selection  procedures  in  which  the  number 
of  tests  is  driven  by  the  data.  Section  4.9  provides  a  discussion  of  the  impact 
on  inference  that  the  careless  use  of  variable  selection  can  have.  Section  4.10 
describes  a  pragmatic  approach  to  variable  selection.  Concluding  remarks  appear 
in  Section  4.11. 


4.2  Frequentist  Hypothesis  Testing 

Early  in  this  chapter  we  will  consider  a  univariate  parameter  0Sl.  Suppose  we  are 
interested  in  evaluating  the  evidence  in  the  data  with  respect  to  the  null  hypothesis: 

H0  :  6  =  e0 

using  a  statistic  T.  By  convention,  large  values  are  less  likely  under  the  null. 
The  observed  value  of  the  test  statistic  is  As  discussed  in  Sect.  2.9,  there 
are  various  possibilities  for  T  including  squared  Wald,  likelihood  ratio,  and  score 
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statistics.  Under  regularity  conditions,  T  — \i  under  the  null,  as  n  — >  oo. 
If  n  is  not  large,  or  regularity  conditions  are  violated,  permutation  or  Monte  Carlo 
tests  (perhaps  based  on  bootstrap  samples,  as  described  in  Sect.  2.7)  can  often  be 
performed  to  derive  the  empirical  distribution  of  the  test  statistic  under  the  null. 
A  type  I  error  is  said  to  occur  when  we  reject  H0  when  it  is  in  fact  true,  while  a  type 
II  error  is  to  not  reject  H0  when  it  is  false. 


4.2.1  Fisherian  Approach 

Under  the  null,  for  continuous  sample  spaces,  the  tail-area  probability  Pr(T  >  t  \ 
H0)  is  uniform.  This  is  not  true  for  discrete  sample  spaces,  but  in  the  following, 
unless  stated  otherwise,  we  will  assume  we  are  in  situations  in  which  uniformity 
holds.  Let 

p  =  pr(T  >  tobs  |  H0) 

denote  the  observed  p-value,  the  probability  of  observing  fob!,  or  a  more  extreme 
value ,  with  repeated  sampling  under  the  null. 

Fisher  advocated  the  pure  test  of  significance,  in  which  the  observed  p- value  is 
reported  as  the  measure  of  evidence  against  the  null  (Fisher  1925a),  with  Hq  being 
rejected  if  p  is  small.  Alternative  hypotheses  are  not  explicitly  considered  and  so 
there  is  no  concept  of  rejecting  the  null  in  favor  of  a  specific  alternative;  ideally,  the 
test  statistic  will  be  chosen  to  have  high  power  under  plausible  alternatives,  however. 


4.2.2  Ney man-Pears on  Approach 

In  contrast  to  the  procedure  of  Fisher,  the  Neyman-Pearson  approach  is  to  specify  an 
alternative  hypothesis,  Hi,  with  H0  nested  in  Hi.  The  celebrated  Neyman-Pearson 
lemma  of  Neyman  and  Pearson  (1933)  proved  that,  for  fixed  type  I  error 

a  =  Pr (T  >  |  H0), 

the  most  powerful  procedure  is  provided  by  the  likelihood  ratio  test  (Sect.  2.9.5). 
The  decision  rule  is  to  reject  the  null  if  p  <  a.  Due  to  the  fixed  threshold,  this 
procedure  controls  the  type  I  error  at  a. 


4.2.3  Critique  of  the  Fisherian  Approach 

A  common  explanation  for  seeing  a  “small”  p- value  is  that  either  Hq  is  not  true 
or  H0  is  true  and  we  have  been  “unlucky.”  A  major  practical  difficulty  is  on 
defining  “small.”  Put  another  way,  how  do  we  decide  on  a  threshold  for  significance? 
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The  p- value  is  uniform  under  the  null,  but  with  a  large  sample  size,  we  will  be  able 
to  detect  very  subtle  departures  from  the  null  and  so  will  often  obtain  small  p- values 
because  the  null  is  rarely  “true.”  To  rectify  this  a  confidence  interval  for  0  is  often 
reported,  along  with  the  p- value.  so  that  the  scientific  significance  of  the  departure  of 
6  from  9q  can  be  determined.  The  ability  to  detect  smaller  and  smaller  differences 
from  the  null  with  increasing  sample  size  suggests  that  the  p- value  threshold  rule 
used  in  practice  should  decrease  with  increasing  n,  but  there  are  no  universally 
recognized  rules.  In  a  hypothesis  testing  context  a  natural  definition  of  consistency 
is  that  the  rule  for  rejection  is  such  that  the  probability  of  the  correct  decision  being 
made  tends  to  1  as  the  sample  size  increases.  So  the  current  use  of  p-values,  in 
which  typically  0.05  or  0.01  is  used  as  a  threshold  for  rejection,  regardless  of  sample 
size,  is  inconsistent ;  by  construction,  the  probability  of  rejecting  the  null  when  it  is 
true  does  not  decrease  to  zero  with  increasing  sample  size.  By  contrast,  the  type  II 
error  will  typically  decrease  to  zero  with  increasing  sample  size.  A  more  balanced 
approach  than  placing  special  emphasis  on  the  type  I  error  would  be  to  have  both 
type  I  and  type  II  errors  decrease  to  zero  as  n  increases. 

There  are  two  common  misinterpretations  of  p- values.  The  most  basic  is  to 
interpret  a  p- value  as  the  probability  of  the  null  given  the  data,  which  is  a  serious 
misconception.  Probabilities  of  the  truth  of  hypotheses  are  only  possible  under  a 
Bayesian  approach.  More  subtly,  using  the  observed  value  of  the  test  statistic  tobs 
does  not  allow  one  to  say  that  following  the  general  procedure  will  result  in  control 
of  the  type  I  error  at  p ,  because  the  threshold  is  data-dependent  and  not  fixed. 
The  key  observation  is  that  the  p-value  is  associated  with,  “observing  tobs,  or  a  more 
extreme  value,”  so  that  the  tail  area  begins  at  the  observed  value  of  the  statistic.  For 
example,  if  p  =  0.013,  we  cannot  say  that  the  procedure  controls  the  type  I  error 
at  1.30%.  Such  control  of  the  type  I  error  is  provided  by  a  fixed  a  level  procedure 
which  is  based  on  a  fixed  threshold,  with  a  =  Pr(T  >  ifix  |  Hq). 

There  is  some  merit  in  the  consideration  of  a  tail  area  when  one  wishes  to 
control  the  type  I  error  rate,  but  when  no  such  control  is  sought,  the  use  of  a  tail 
area  seems  simply  of  mathematically  convenience.  As  an  alternative  the  ordinate 
p(T  =  tobs  |  Hq)  may  be  considered,  which  brings  one  closer  to  a  Bayesian 
formulation  (see  Sect.  4.3.1),  but  from  a  frequentist  perspective,  it  is  not  clear  how 
to  scale  the  observed  statistic  without  an  alternative  hypothesis. 


4.2.4  Critique  of  the  Neyman-Pearson  Approach 

As  with  the  use  of  p-values  we  need  to  decide  on  a  size  a  for  the  test.  The  historical 
emphasis  has  been  on  fixing  a  and  then  evaluating  power,  but  as  with  a  threshold  for 
p- values,  practical  guidance  on  how  a  should  depend  on  sample  size  is  important 
but  lacking.  With  an  a  level  that  does  not  change  with  sample  size,  one  is  implicitly 
accepting  that  type  II  errors  become  more  important  with  increasing  sample  size, 
and  in  a  manner  which  is  implied  rather  than  chosen  by  the  investigator.  Pearson 
(1953,  p.  68)  expressed  the  desirability  of  a  decreasing  a  as  sample  size  increases: 
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. .  the  quite  legitimate  device  of  reducing  a  as  n  increases.”  As  we  have  already 
noted,  a  fixed  significance  level  with  respect  to  n  gives  an  inconsistent  procedure. 

By  merely  stating  that  p  <  a,  information  is  lost,  but  if  we  state  an  observed 
/> value,  then  we  lose  control  of  the  type  I  error  because  control  requires  a  fixed 
binary  decision  rule.  The  procedure  must  also  be  viewed  in  the  light  of  both 
Hq  and  Hi  being  “wrong”  since  no  model  is  a  correct  specification  of  the  data- 
generating  process. 

For  discrete  data,  the  discreteness  of  the  statistic  causes  difficulties,  particularly 
for  small  sample  sizes.  To  achieve  exact  level  a  tests,  so-called  randomization 
rules  have  been  suggested.  Under  such  rules,  the  same  set  of  data  may  give 
different  conclusions  depending  on  the  result  of  the  randomization,  which  is  clearly 
undesirable. 


4.3  Bayesian  Hypothesis  Testing  with  Bayes  Factors 


4.3.1  Overview  of  Approaches 


In  the  Bayesian  approach,  all  unknowns  in  a  model  are  treated  as  random  variables, 
even  though  they  relate  to  quantities  that  are  in  reality  fixed.  Therefore,  the  “true” 
hypothesis  is  viewed  as  an  unknown  parameter  for  which  the  posterior  is  derived, 
once  the  alternatives  have  been  specified.  The  latter  step  is  essential  since  we  require 
a  sample  space  of  hypotheses.  In  the  case  of  two  hypotheses,  we  have  the  following 
candidate  data-generating  mechanisms: 

H0  =>  A)  |  -Ho  =>  V  I  A, 

Hi  =>  /3i  |  Hi  =>  y  |  A.- 


The  posterior  probability  of  Hj  is,  via  Bayes  theorem, 


Pr (Hj  |  y)  = 


p(y  I  A)  x  nj 
p(y ) 


with  7 Tj  the  prior  probability  of  hypothesis  Hj,j  =  0, 1.  The  likelihood  of  the  data  is 

p(y  I  A)  =  J p(v\  AMA  I  A)  dA  t4-1) 


with  p(/3j  |  Hj)  the  prior  distribution  over  the  parameters  associated  with 

hypothesis  Hj,  j  =  0,1,  and 


P(y)  =  p{y  I  H0)  x  t T0+p(y  |  Hi)  x  TTi. 
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The  posterior  odds  in  favor  of  Hq  is  therefore 


Posterior  Odds  =  ^>r^°  j  ^  =  Bayes  factor  x  Prior  Odds  (4.2) 
Pr(7Ti  |  y) 


where  the 


Bayes  factor 


p{y  I  Hq) 
p{y  I  -Hi)’ 


(4.3) 


and  the  prior  odds  are  'Kq/tti  with  7Ti  =  1  ~  ttq.  The  Bayes  factor  is  the  ratio  of 
the  density  of  the  data  under  the  null  to  the  density  under  the  alternative  and  is  an 
intuitively  appealing  summary  of  the  information  the  data  provide  concerning  the 
hypotheses.  The  Bayes  factor  was  discussed  previously  in  Sect.  3.10.  From  (4.2), 
we  also  see  that 


Bayes  Factor 


Posterior  Odds 
Prior  Odds  ’ 


which  emphasizes  that  the  Bayes  factor  summarizes  the  information  in  the  data  and 
does  not  involve  the  prior  beliefs  about  the  hypotheses.  As  can  be  seen  in  (4.1), 
priors  on  the  parameters  are  involved  in  each  of  the  numerator  and  denominator  of 
the  Bayes  factor,  since  these  provide  the  distributions  over  which  the  likelihoods  are 
averaged. 

When  it  comes  to  reporting/making  decisions,  various  approaches  based  on 
Bayes  factors  are  available  for  different  contexts.  Most  simply,  one  may  just  report 
the  Bayes  factor.  Kass  and  Raftery  (1995),  following  Jeffreys  (1961),  present  a 
guideline  for  the  interpretation  of  Bayes  factors.  For  example,  if  the  negative  log 
base  10  Bayes  factor  lies  between  1  and  2  (so  that  the  data  are  10-100  times 
more  likely  under  the  alternative,  as  compared  to  the  null),  then  there  is  said  to 
be  strong  evidence  against  the  null  hypothesis.  Such  thresholds  may  be  useful  in 
some  situations,  but  in  general  one  would  like  the  guidelines  to  be  context  driven. 
Going  beyond  the  consideration  of  the  Bayes  factor  only,  one  may  include  prior 
probabilities  on  the  null  and  alternative,  to  give  the  posterior  odds  (4.2).  Stating  the 
posterior  probabilities  may  be  sufficient,  but  one  may  wish  to  derive  a  formal  rule 
for  deciding  upon  which  of  Hq  or  Hi  to  report. 

Recall  from  Sect.  3.10  that,  under  a  Bayesian  decision  theory  approach  to 
hypothesis  testing,  the  “decision”  8  is  taken  that  minimizes  the  posterior  expected 
loss.  Following  the  notation  of  Table  3.3,  the  losses  associated  with  type  I  and  type 
II  errors  are  and  Ln,  respectively.  Minimization  of  the  posterior  expected  loss 
then  results  in  the  rule  to  choose  <5  =  1  if 


Pr(H1  |  y)  >  U 
Pr(JF0  |  y)  -  V 


or  equivalently  if 


1 


Pr(Hi  |  y)  > 


1  +  Ln/ Li 


(4.4) 


158 


4  Hypothesis  Testing  and  Variable  Selection 


For  example,  if  a  type  I  error  is  four  times  as  bad  as  a  type  II  error,  we  should  report 
Hi  only  if  Pr(F/i  |  y)  >  0.8.  In  contrast,  if  the  balance  of  losses  is  reversed,  and  a 
type  II  error  is  four  times  as  costly  as  a  type  I  error,  we  report  Hi  if  I  Jr(  II  \  \  y)> 0.2. 

Discreteness  of  the  sample  space  does  not  pose  any  problems  for  a  Bayesian 
analysis,  since  one  need  only  consider  the  data  actually  observed  and  not  other 
hypothetical  realizations. 


4.3.2  Critique  of  the  Bayes  Factor  Approach 

As  always  with  the  Bayesian  approach,  we  need  to  specify  priors  for  all  of  the 
unknowns,  which  here  correspond  to  each  of  the  hypotheses  and  all  parameters 
(including  nuisance  parameters)  that  are  contained  within  the  models  defined  under 
the  two  hypotheses.  It  turns  out  that  placing  improper  priors  upon  the  parameters 
that  are  the  focus  of  the  hypothesis  test  leads  to  anomalous  behavior  of  the  Bayes 
factor.  We  give  an  informal  discussion  of  the  fundamental  difference  between 
estimation  and  hypothesis  testing  with  respect  to  the  choice  of  improper  priors. 
Suppose  we  have  a  model  that  depends  on  a  univariate  unknown  parameter,  9  with 
improper  prior  p{9)  =  c,  for  arbitrary  c  >  0.  The  posterior,  upon  which  estimation 
is  based,  is 

v{y  I  fl)p(fl)  r45, 

J  p(y  I  0)p{9)  dd 

and  so  the  arbitrary  constant  in  the  prior  cancels  in  both  numerator  and  denominator. 
Now  suppose  we  are  interested  in  comparison  of  the  hypotheses  Hq  :  0  =  9q, 
Hi  :  9  f  9q  with  9  £  M.  The  Bayes  factor  is 

P(y  I  H0)  =  p(y  |  6>0) 
p{y\Hi)  Jp(y\O)p(0)d9’ 

so  that  the  denominator  of  the  Bayes  factor  depends,  crucially,  upon  c.  Hence,  in 
this  setting  the  Bayes  factors  with  an  improper  prior  on  9  is  not  well  defined. 

Specifying  prior  distributions  for  all  of  the  parameters  under  each  hypothesis  can 
be  difficult,  but  Sect.  3.11  describes  a  strategy  based  on  test  statistics  which  requires 
a  prior  distribution  for  the  parameter  of  interest  only. 

In  principle,  one  can  compare  non-nested  models  using  a  Bayesian  approach, 
but  in  practice  great  care  must  be  taken  in  specifying  the  priors  under  the  two 
hypotheses,  in  order  to  not  inadvertently  favor  one  hypothesis  over  another.  One 
possibility  is  to  specify  priors  on  functions  of  the  parameters  that  are  meaningful 
under  both  hypotheses;  for  an  example  of  this  approach,  see  Sect.  6. 16. 

As  with  the  Neyman-Pearson  approach,  all  of  the  calculations  have  to  be 
conditioned  upon  H0  U  Hi.  In  a  Bayesian  context,  we  need  to  emphasize  that 
we  are  obtaining  the  posterior  probability  of  the  null  given  one  of  the  null  or 
alternative  is  true  and  under  the  assumed  likelihood  and  priors.  Consequently, 
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posterior  probabilities  on  hypotheses  must  be  viewed  in  a  relative,  rather  than  an 
absolute,  sense  since  the  truth  will  rarely  correspond  to  Ho  or  Hi .  Hence,  the  precise 
interpretation  is  that  the  posterior  probability  of  Hq  is  the  posterior  probability  of 
Hq,  given  that  one  of  Hq  or  Hi  is  true. 

If  one  follows  the  decision  theory  route,  one  must  also  specify  the  ratio  of  losses 
which  is  usually  difficult.  In  general,  Bayes  factor  calculation  requires  analytically 
intractable  integrals  over  the  null  and  alternative  parameter  spaces,  to  give  the  two 
normalizing  constants  p(y  \  Hq)  and  p(y  \  Hi).  Further,  Markov  chain  Monte 
Carlo  approaches  do  not  simply  supply  these  normalizing  constants.  Analytical 
approximations  exist  under  certain  conditions,  see  Sect.  3.10. 


4.3.3  A  Bayesian  View  of  Frequentist  Hypothesis  Testing 

We  consider  an  artificial  situation  in  which  the  only  available  data  in  a  Bayesian 
analysis  corresponds  to  knowing  that  the  event  T  >  tf„  has  occurred.  This  means 
that  the  likelihood  of  the  data,  Pr(  data  |  H0)  coincides  with  the  a  level.  To  obtain 
Pr(iT0  |  data  )  we  must  specify  the  alternative  hypothesis.  We  consider  the  simple 
case  in  which  the  model  contains  a  single  parameter  6  with  null  H0  :  6  =  6q  and 
alternative  Hi  :  6  =  9i .  Then 


Pr {H0  |  data  ) 


Pr(  data  |  Hq)  x  7To 

Pr(  data  |  Hq)  x  ttq  +  Pr(  data  |  Hi)  x  7Ti 


(4.6) 


where  i Tj  =  Pr (Hj),  j  =  0, 1.  Dividing  by  Pr(fTi  |  data  )  gives 

Prf  data  I  H  ) 

Posterior  Odds  =  — - ; — —  x  Prior  Odds 

Pr(  data  |  Hfj 

=  - — — —  x  Prior  Odds  (4.7) 

power  at  0 1 

which  depends,  in  addition  to  the  a  level,  on  the  prior  on  H0,  ttq,  and  on  the  power , 
Pr(  data  |  Hi).  Equation  (4.7)  implies  that,  for  two  studies  that  report  a  result  as 
significant  at  the  same  a  level,  the  one  with  the  greater  power  will,  in  a  Bayesian 
formulation,  provide  greater  evidence  against  the  null.  The  power  is  never  explicitly 
considered  when  reporting  under  the  Fisherian  or  Neyman-Pearson  approaches. 
An  important  conclusion  is  that  to  make  statements  about  the  “evidence”  that  the 
data  contain  with  respect  to  a  hypothesis,  as  summarized  in  an  a  level,  one  would 
want  to  know  the  power  or,  as  a  minimum,  the  sample  size  (since  this  is  an  important 
component  of  the  power). 

The  prior  is  also  important  which  seems,  as  already  noted,  reasonable  when  one 
considers  the  usual  interpretation  of  a  tail  area  in  terms  of  “either  H0  is  true  and  we 
were  unlucky  or  H0  is  not  true.”  A  prior  on  Hq  is  very  useful  in  weighing  these  two 
possibilities.  A  key  observation  is  that  although  a  particular  dataset  may  be  unlikely 
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Fig.  4.1  Lower  bound  for 
Pr(f/o  |  data),  under  three  _ 

prior  specifications,  as  a  c 

function  of  the  p-  value  ^ 
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under  the  null,  it  may  also  be  unlikely  under  chosen  alternatives,  so  that  there  may 
be  insufficient  evidence  to  reject  the  null,  at  least  in  comparison  to  these  alternatives. 

Sellke  et  al.  (2001)  summarize  a  number  of  different  arguments  that  lead  to  the 
following,  quite  remarkable,  result.  For  a  p-value  p  <  e-1  =  0.368: 


Pr(iF0  |  data  )  > 


1  - 


1  x  ttiX'1’ 
ep  log  p  7 T0) 


(4.8) 


Hence,  given  a  p-value,  one  may  calculate  a  lower  bound  on  the  posterior  probability 
of  the  null.  Figure  4.1  illustrates  this  lower  bound,  as  a  function  of  the  p- 
value,  for  three  different  prior  probabilities,  7To-  We  see,  for  example,  that  with 
a  p-value  of  0.05  and  a  prior  probability  on  the  null  of  7to=0.75,  we  obtain 
Pr(iT0  |  data  )  >  0.55. 

The  discussion  of  Sect.  4.2.3,  combined  with  the  implications  of  (4.7)  and  (4.8), 
might  prompt  one  to  ask  why  p-values  are  still  in  use  today,  in  particular  with 
the  almost  ubiquitous  application  of  a  0.05  or  0.01  decision  threshold.  With  these 
thresholds,  which  are  often  required  for  the  publication  of  results,  the  relationship 
(4.8),  with  7Tq  =  0.5,  gives  Pr(iTo  |  data  )>  0.29  and  0.11  with  p  =  0.05  and 
0.01,  respectively.  Rejection  of  H0  with  such  probabilities  may  not  be  unreasonable 
in  some  circumstances  but  the  difference  between  the  p-value  and  Pr(iT0  |  data)  is 
apparent. 

Small  prior  probabilities,  7r0,  were  not  historically  the  norm  since,  particularly  in 
experimental  situations,  data  would  not  be  collected  if  there  were  little  chance  the 
alternative  were  true. 

In  some  disciplines  scientists  may  calibrate  p-values  to  the  sample  sizes  with 
which  they  are  familiar,  as  no  doubt  Fisher  did  when  the  0.05  rule  emerged. 
For  example,  in  Tables  29  and  30  of  Statistical  Methods  for  Research  Workers 
(Fisher  1990),  the  sample  sizes  were  30  and  17,  and  Fisher  discusses  the  0.05  limit 
in  each  case,  though  in  both  cases  he  concentrates  more  on  the  context  than  on  the 
absolute  value  of  0.05. 


4.4  The  Jeffreys-Lindley  Paradox 
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Poor  calibration  of  p-values  could  be  one  of  the  reasons  why  so  many  “findings” 
are  not  reproducible,  along  with  the  other  usual  suspects  of  confounding,  data 
dredging,  multiple  testing,  and  poorly  measured  covariates. 


4.4  The  Jeffreys-Lindley  Paradox 

We  now  discuss  a  famous  example  in  which  Bayesian  and  frequentist  approaches 
to  hypothesis  testing  give  starkly  different  conclusions.  The  example  has  been 
considered  by  many  authors,  but  Lindley  (1957)  and  Jeffreys  (1961)  provide  early 
discussions;  see  also  Bartlett  (1957).  To  illustrate  the  so-called  Jeffreys-Lindley 
“paradox,”  we  assume  that  Yn  \  9  ~  N(9,a2/n)  with  a2  known  and  9  unknown. 
Suppose  the  null  is  H0  :  9  =  0,  with  alternative  Hi  :  9  ^  0.  Let 

Vn  =  Zi_a/2  x  er /y/n 

where  a  is  the  level  of  the  test  and  Pr (Z,  <  Zi_a/2)  =  1  —  cx/2,  with  Z  ~  N(0, 1). 
We  define  yn  in  this  manner,  so  that  for  different  values  of  n  the  a  level  remains 
constant.  For  a  Bayesian  analysis,  assume  ttq  =  Pr(iTo),  and  under  the  alternative 
9  ~  N(0,  t2).  In  the  early  discussions  of  the  paradox,  a  uniform  prior  over  a  finite 
range  was  assumed,  but  the  message  of  the  paradox  is  unchanged  with  the  use  of  a 
normal  prior.  Then 


Pr(ffo  I  yn)  = 


Bayes  Factor  x  Prior  Odds 
1  +  Bayes  Factor  x  Prior  Odds 


where  the  Bayes  factor  is 

Bayes  Factor  =  p{f_n  \  H°\  (4.9) 

PKVn  I 

and  the  Prior  Odds  =  7r0 / ( 1  —  7To).  The  prior  predictive  distributions,  the  ratios  of 
whose  densities  give  the  Bayes  factor  (4.9),  are 

yn  |  H0  ~N(0,a2/n)  (4.10) 

yn  |  Hi  ~N(0,a2/n+r2).  (4.11) 

Figure  4.2  shows  these  two  densities,  as  a  function  of  yn,  for  <r2  =  1,  r2  =  0.22,  and 
n  =  100.  An  a  level  of  0.05  gives  yn  =  1.96  x  a/y/n  =  0.20,  the  value  indicated  in 
the  figure  with  a  dashed-dotted  vertical  line.  For  this  value,  the  Bayes  factor  equals 
0.48,  so  that  the  data  are  roughly  twice  as  likely  under  the  alternative  as  compared 
to  the  null.  The  Sellke  et  al.  (2001)  bound  on  the  Bayes  factor  is  BF  >  -ep  logp 
which  forp  =  0.05  gives  BF  >  0.41. 
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Fig.  4.2  Numerator  ( solid 
line)  and  denominator 
(dashed  line)  of  the  Bayes 
factor  for  n  =  100.  The 
model  is  Yn  \  9  ~ 

N(0,  <r2/ra)  with  <r2  =  1. 
The  null  and  alternative  are 
Ho  :  6  =  0  and  H\  :  6  7^  0, 
and  the  prior  under  the 
alternative  is  9  ~  N(0,  r2) 
with  t2  =  0.22.  The 
dashed-dotted  vertical  line 
corresponds  to  yn  =  0.20 
which  for  this  n  gives 
a  =  0.05 


The  Bayes  factor  is  the  ratio  of  (4.10)  and  (4.1 1): 


Bayes  Factor 


(27T<r2/n)  1/2  exp 

Vn 

2a2 /n 

(27 r[<r2/n  +  t2])-1/2  exp 

Vn 

2(cr2  /  n+r2) 

/  er2/?z  +  r2 

2 

^1  — a/2  T 

/  2  /  exP 

/  o" /n 

2  t2  +  a'2  /n 

(4.12) 


This  last  expression  reveals  that,  as  71  — >  00,  the  Bayes  factor  — >  00,  so  that  Pr(H0  I 
yn )  — >  1.  Therefore,  the  “paradox”  is  that  for  a  level  of  significance  a,  chosen  to 
be  arbitrarily  small,  we  can  find  datasets  which  make  the  posterior  probability  of 
the  null  arbitrarily  close  to  1 ,  for  some  n.  Hence,  frequentist  and  Bayes  procedures 
can,  for  sufficiently  large  sample  size,  come  to  opposite  conclusions  with  respect  to 
a  hypothesis  test. 

Figure  4.3  plots  the  posterior  probability  of  the  null  as  a  function  of  n  for 
a2  =  1,  t2  =  0.22, 7To  =  0.5,  a  =  0.05.  From  the  starting  position  of  0.5  (the  prior 
probability,  indicated  as  a  dashed  line),  the  curve  Pr(TT0  |  yn )  initially  falls, 
reaching  a  minimum  at  around  n  =  100,  and  then  increases  towards  1,  illustrating 
the  “paradox.”  For  large  values  of  n,  yn  is  very  close  to  the  null  value  of  0,  but  there 
is  high  power  to  detect  any  difference  from  0,  and  so  an  a  of  0.05  is  not  difficult 
to  achieve.  The  Bayes  factor  also  incorporates  the  density  under  the  alternative  and 
values  close  to  0  are  more  likely  under  the  null,  as  illustrated  in  Fig.  4.2. 

We  now  consider  a  Bayesian  analysis  of  the  above  problem  but  assume  that  the 
data  appear  only  in  the  form  of  knowing  that  \Yn\  >yn,  a  censored  observation. 
This  is  clearly  not  the  usual  situation  since  a  Bayesian  would  condition  on  the  actual 
value  observed,  but  it  does  help  to  understand  the  paradox.  The  Bayes  factor  is 
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Fig.  4.3  Posterior 
probability  of  the  null  versus 
sample  size,  for  a  fixed  a 
level  of  0.05.  The  model  is 
Yn  |  8  ~  N (d,o2/n)  with 
cr2  =  1.  The  null  and 
alternative  are  Ho  :  8  =  0  and 
Hi  :  8  0,  and  the  prior 

under  the  alternative  is 
6  ~  N(0,t2)  with  r2  =  0.22 


a  b 


n  n 

Fig.  4.4  Bayes  factor  based  on  a  tail  area  with  null  and  alternative  of  Ho  :  6  =  0  and  H\  :  8  ^  0: 
(a)  Average  power,  which  corresponds  to  the  denominator  of  the  Bayes  factor,  under  a  N(0,  0.22) 
prior  and  for  a  fixed  a  level  of  0.05  and  (b)  Bayes  factor  based  on  the  tail  area,  with  a  =  0.05;  the 
horizontal  dashed  line  indicates  a  tail-area  Bayes  factor  value  of  0.05 


Pl'd^n]  >  yn \H0)  =  _ a _ 

Pr(|F„|  >  yn\Hx)  /Pr(|F„|  >  yn\0)p{9)  dO  ’ 

that  is,  the  type  I  error  rate  divided  by  the  power  averaged  over  the  prior  p(0). 
Figure  4.4a  gives  the  average  power  as  a  function  of  n.  We  see  a  monotonic  increase 
with  sample  size  towards  the  value  1,  as  we  would  expect  with  fixed  a. 

Since  the  Bayes  factor  is  the  ratio  of  a  to  the  average  power,  we  see  in  Fig.  4.4b 
that  the  Bayes  factor  based  on  the  tail-area  information  is  monotonic  decreasing 
towards  a  as  n  increases  (and  with  7To  =  0.5,  this  gives  the  posterior  probability  of 
the  null  also).  For  our  present  purposes,  the  calculation  with  the  tail  area  illustrates 
that  when  a  Bayesian  analysis  conditions  on  a  tail  area,  the  conclusions  are  in  line 
with  a  frequentist  analysis. 
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The  difference  in  behavior  between  a  genuine  Bayesian  analysis  that  conditions 
on  the  actual  statistic  and  that  based  on  conditioning  on  a  tail  area  is  apparent. 
As  noted  by  Lindley  (1957,  p.  189-190),  . .  the  paradox  arises  because  the 

significance  level  argument  is  based  on  the  area  under  a  curve  and  the  Bayesian 
argument  is  based  on  the  ordinate  of  the  curve.” 

Ignoring  now  the  comparison  with  tests  of  significance,  it  is  informative  to 
examine  the  Bayes  factor  for  fixed  yn.  Upon  rearrangement  of  (4. 12), 


As  r2  — >  oo,  the  Bayes  Factor  — >  oo  so  that  Pr(7To  I  yn )  1>  which  is  at  first 

sight  counter  intuitive  since  increasing  r2  places  less  prior  mass  close  to  9  =  0. 
However,  this  behavior  occurs  because  averaging  with  respect  to  the  prior  on  9  with 
large  r2  produces  a  small  Pr (yn  \  Hi),  because  the  prior  under  the  alternative  is 
spreading  mass  very  thinly  across  a  large  range;  r2  0  suggests  very  little  prior 
belief  in  any  9^0.  Hence,  even  if  the  data  point  strongly  to  a  particular  9  ^  0,  we 
still  prefer  Hq.  More  generally,  r2  0  should  not  be  interpreted  as  “ignorance” 
since  it  supports  very  big  effects.  Said  another  way,  as  r2  — >  0,  the  Bayes  factor 
favors  the  alternative,  even  though  as  r2  gets  smaller  and  smaller  the  prior  under  the 
alternative  becomes  more  and  more  concentrated  about  the  null. 


4.5  Testing  Multiple  Hypotheses:  General  Considerations 

In  the  following  sections  we  examine  how  inference  proceeds  when  more  than  a 
single  hypothesis  test  is  performed.  There  are  many  situations  in  which  multiple 
hypothesis  testing  arises,  but  we  concentrate  on  just  two.  In  the  first,  which  we  refer 
to  as  a  fixed  number  of  tests  scenario,  we  suppose  that  the  number  of  hypotheses 
to  be  tested  is  known  a  priori,  and  is  not  data  driven,  which  makes  the  task  of 
evaluating  the  properties  of  proposed  solutions  (both  frequentist  and  Bayesian)  more 
straightforward.  This  case  is  discussed  in  Sect.  4.6.  As  an  example,  we  will  shortly 
introduce  a  running  example  that  concerns  comparing,  between  two  populations, 
expression  levels  for  m  =  1,000  gene  transcripts  (during  transcription,  a  gene 
is  transcribed  into  (mutiple)  RNA  transcripts).  In  the  second  situation,  which  we 
refer  to  as  variable  selection,  and  which  is  discussed  in  Sect.  4.7,  the  number  of 
hypotheses  to  be  tested  is  random,  which  makes  the  evaluation  of  properties  more 
difficult. 

One  of  the  biggest  abuses  of  statistical  techniques  is  the  unprincipled  use  of 
model  selection.  Two  examples  of  this  are  separately  testing  the  significance  of 
a  large  number  of  variables  and  then  reporting  only  those  that  are  nominally 
“significant”  (the  problem  considered  in  Sect.  4.6),  and  testing  multiple  confounders 
to  see  which  ones  to  control  for  (the  problem  considered  in  Sect.  4.7).  In  each 
of  these  cases,  even  if  the  exact  procedure  is  described,  unless  care  is  exercised, 
interpretation  is  extremely  difficult. 
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4.6  Testing  Multiple  Hypotheses:  Fixed  Number  of  Tests 

Suppose  we  wish  to  examine  the  association  between  a  response  and  m  different 
covariates.  In  a  typical  epidemiological  study,  many  potential  risk  factors  are 
measured,  and  an  exploratory,  hypothesis-generating  procedure  may  systematically 
examine  the  association  between  the  outcome  and  each  of  the  risk  factors.  In 
general,  the  covariates  may  not  be  independent,  which  complicates  the  analysis. 
Another  fixed  number  of  tests  scenario  is  when  to  responses  are  examined  with 
respect  to  a  single  covariate.  Recently,  there  has  been  intense  interest  in  so-called 
high  throughput  techniques  in  which  thousands,  or  tens  of  thousands,  of  variables 
are  measured,  often  as  a  screening  exercise  in  which  the  aim  is  to  see  which  of 
the  variables  are  associated  with  some  biological  endpoint.  For  example,  one  may 
examine  whether  the  expression  levels  of  many  thousands  of  genes  are  elevated  or 
reduced  in  samples  from  cancer  patients,  as  compared  to  cancer-free  individuals. 

When  m  tests  are  preformed,  the  aim  is  to  decide  which  of  the  nulls  should 
be  rejected.  Table  4. 1  shows  the  possibilities  when  m  tests  are  performed  and  K  are 
flagged  as  requiring  further  attention.  Here  mo  is  the  number  of  true  nulls,  B  is  the 
number  of  type  I  errors,  and  C  is  the  number  of  type  II  errors,  and  each  of  these 
quantities  is  unknown.  The  aim  is  to  select  a  rule  on  the  basis  of  some  criterion  and 
this  in  turn  will  determine  K.  The  internal  cells  of  Table  4.1  are  random  variables, 
whose  distribution  depends  on  the  rule  by  which  K  is  derived. 


Example:  Microarray  Data 

To  illustrate  the  multiple  testing  problem  in  a  two-group  setting,  we  examine  a 
subset  of  microarray  data  presented  by  Kerr  (2009).  The  data  we  analyze  consist 
of  expression  levels  on  m  =  1,000  transcripts  measured  in  Epstein-Barr  virus- 
transformed  lymphoblastic  cell  line  tissue,  in  each  of  two  populations.  Each 
transcript  was  measured  on  60  individuals  of  European  ancestry  (CEU)  and  45 
ethnic  Chinese  living  in  Beijing  (CHB).  The  data  have  been  normalized,  and  log2 
transformed,  so  that  a  one-unit  difference  between  recorded  values  corresponds  to  a 
doubling  of  expression  level. 

Let  Y ki  be  the  measured  expression  level  for  transcript  i  in  population  k,  with 
i  =  1 , ,m,  and  k  =  0/1  representing  the  CEU/CHB  populations.  Then  define 


Table  4.1  Possibilities  when 
m  tests  are  performed  and  K 
are  flagged  as  worthy  of 
further  attention 


Not  flagged 

Flagged 

H0 

A 

B 

m0 

Hi 

C 

D 

mi 

m  —  K 

K 

m 
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Z  scores  p-values 

Fig.  4.5  (a )Z  scores  and  (b)  p-values,  for  1,000  transcripts  in  the  microarray  data 

Yt  =  Yu  —  Yq,  and  let  s'kl  be  the  sample  variance  in  population  k,  for  transcript  i, 
i  =  1, . . . ,  m.  We  now  assume 

Y  |  ^iid  (7  i  ) 

where  of  =  s\J 60  +  sj^/ 45  is  the  sample  variance,  which  is  reliably  estimated 
for  the  large  sample  sizes  in  the  two  populations  and  therefore  assumed  known. 
The  null  hypotheses  of  interest  are  that  the  difference  in  the  average  expression 
level  between  the  two  populations  is  zero.  We  let  Hi  =  0  correspond  to  the  null  for 
transcript  i,  that  is,  //,  =  0  for  i  =  1, . . . ,  m.  Figure  4.5a  gives  a  histogram  of  the 
Z  scores  Y(/o,,  along  with  the  reference  N(0, 1)  distribution.  Clearly,  unless  there 
are  problems  with  the  model  formulation,  there  are  a  large  number  of  transcripts 
that  are  differentially  expressed  between  the  two  populations,  as  confirmed  by  the 
histogram  of  p-values  displayed  in  Fig.  4.5b. 


4. 6. 1  Frequentist  Analysis 

In  a  single  test  situation  we  have  seen  that  the  historical  emphasis  has  been  on 
control  of  the  type  I  error  rate.  We  let  Hi  =  0/1  represent  the  hypotheses  for  the 
i  =  1, ...  ,m  tests.  In  a  multiple  testing  situation  there  are  a  variety  of  criteria  that 
may  be  considered.  With  respect  to  Table  4. 1 ,  the  family-wise  error  rate  (FWER) 
is  the  probability  of  making  at  least  one  type  I  error,  that  is,  Pr(S  >  1  I  Hi  = 
0, . . .  ,Hm  =  0).  Intuitively,  this  is  a  sensible  criteria  if  one  has  a  strong  prior 
belief  that  all  (or  nearly  all)  of  the  null  hypotheses  are  true,  since  in  such  a  situation 
making  at  least  one  type  I  error  should  be  penalized  (this  is  made  more  concrete 
in  Sect.  4.6.2).  In  contrast,  if  one  believes  that  a  number  of  the  nulls  are  likely  to 
be  false,  then  one  would  be  prepared  to  accept  a  greater  number  of  type  I  errors, 
in  exchange  for  discovering  more  true  associations.  As  in  all  hypothesis  testing 
situations,  we  want  a  method  for  trading  off  type  I  and  type  II  errors. 


4.6  Testing  Multiple  Hypotheses:  Fixed  Number  of  Tests 


167 


Table  4.2  True  FWER  as  a 
function  of  the  correlation  p 

P 

True  FWER 

between  two  bivariate  normal 

0 

0.0497 

test  statistics 

0.3 

0.0484 

0.5 

0.0465 

0.7 

0.0430 

0.9 

0.0362 

Let  Bi  be  the  event  that  the  ith  null  is  incorrectly  rejected,  so  that,  with  respect 
to  Table  4.1 ,  B,  the  random  variable  representing  the  number  of  incorrectly  rejected 
nulls,  corresponds  to  U™  1Bi.  With  a  common  level  for  each  test  a*,  the  FWER  is 

aF  =  Pr(£?  >1  |  .Hi  =  0, ,  Hm  =  0)  =  Pr  (U™  xBr  |  JT,  =  0, . . . ,  Hm  =  0) 

m 

<^Pr(Bi\H1=0,...,Hm=0) 

i=i 

=  met*.  (4.13) 

The  Bonferroni  method  takes  a*  =  aF/m  to  give  FWER  <  aF.  For  example,  to 
control  the  FWER  at  a  level  of  a  =  0.05  with  m  =  10  tests,  we  would  take  a *  = 
0.05/10  =  0.005.  Since  it  controls  the  FWER,  the  Bonferroni  method  is  stringent 
(i.e.,  conservative  in  the  sense  that  the  bar  is  set  high  for  rejection)  and  so  can  result 
in  a  loss  of  power  in  the  usual  situation  in  which  the  FWER  is  set  at  a  low  value, 
for  example  0.05.  A  little  more  conservatism  is  also  introduced  via  the  inequality, 
(4.13).  The  Sidak  correction,  which  we  describe  shortly,  overcomes  this  aspect. 

If  the  test  statistics  are  independent, 

Pr (B  >  1)  =  1  -  Pr (B  =  0) 

=  l-Pr(n^1B0 

m 

=  1-1[Pt(B') 

i= 1 

=  1  —  (1  —  a*)m. 

Consequently,  to  achieve  FWER  =  aF  we  may  take  a*  =  1  —  (1  —  o,.)1  /r" .  the 
so-called  Sidak  correction  (Sidak  1967). 

With  dependent  tests,  the  Bonferroni  approach  is  even  more  conservative;  we 
demonstrate  with  m  =  2  and  bivariate  normal  test  statistics  with  correlation 
p.  Suppose  we  wish  to  achieve  a  FWER  of  0.05.  Table  4.2  gives  the  FWER 
achieved  using  Bonferroni  and  illustrates  how  the  test  becomes  more  conservative 
as  the  correlation  increases.  The  situation  becomes  worse  as  m  increases  in  size. 
The  fc-FWER  criteria  (Lehmann  and  Romano  2005)  extends  FWER  to  the  incorrect 
rejection  of  k  or  more  nulls  (Exercise  4.2). 
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A  simple  remedy  to  the  conservative  nature  of  the  control  of  FWER  is  to  increase 
a,.-.  An  intuitive  measure  to  calibrate  a  procedure  is  via  the  expected  number  of  false 
discoveries: 


EFD  =  mo  x  a* 
<  m  x  a* 


where  a*  is  the  level  for  each  test.  If  mo  is  close  to  m,  this  inequality  will  be 
practically  useful.  As  an  example,  one  could  specify  a*  such  that  the  EFD  <  1 
(say),  by  choosing  a*  =  1  /m. 

Recently  there  has  been  interest  in  a  criterion  that  is  particularly  useful  in 
multiple  testing  situations.  We  first  define  the  false  discovery  proportion  (FDP)  as 
the  proportion  of  incorrect  rejections: 


Then  th e  false  discovery  rate  (FDR),  the  expected  proportion  of  rejected  nulls  that 
are  actually  true,  is 


FDR  =  E[FDP]  =  E[B/K  \  B  >  0]Pr(B  >  0). 


Consider  the  following  procedure  for  independent  p- values,  each  of  which  is 
uniform  under  the  null: 

1.  Let  <  . . .  <  P(m)  denote  the  ordered  p- values. 

2.  Define  li  =  ia/m  and  R  =  max{i  :  Ppp,  <  li}  where  a  is  the  value  at  which 
we  would  like  FDR  control. 

3.  Define  the  p- value  threshold  as  pr  =  P(R)- 

4.  Reject  all  hypotheses  for  which  Pi  <  Pt,  that  is,  set  Hi  =  1  in  such  cases, 
i  =  1, . . .  ,m. 

Benjamini  and  Hochberg  (1995)  show  that  if  this  procedure  is  applied,  then 
regardless  of  how  many  nulls  are  true  (mo)  and  regardless  of  the  distribution  of 
the  p- values  when  the  null  is  false, 


FDR  <  —a  <  a. 


m 


We  say  that  the  FDR  is  controlled  at  a. 


Example:  Hypothetical  Data 


We  simulate  data  from  m  =  100  hypothetical  tests  in  which  mo  =  95  tests  are 
null,  to  give  mi  =  5  tests  for  which  the  alternative  is  true.  Figure  4.6  displays 
the  sorted  observed  —  log10 (p- values)  versus  the  expected  —  log10(p- values),  along 
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Fig.  4.6  Observed  versus 
expected  —  log10(p- values) 
for  a  simulated  set  of  data 
with  95  nulls  and  5 
alternatives.  Three  criteria  for 
rejection,  based  on 
Bonferroni,  the  expected 
number  of  false  discoveries 
(EFD),  and  the  false 
discovery  rate  (FDR),  are 
included  on  the  plot 


with  a  line  of  equality  (solid  line).  Also  displayed  are  three  approaches  to  calling 
significance.  The  top  dashed  line  corresponds  to  a  Bonferroni  correction  at  the  5% 
level  (so  that  the  line  is  at  —  log10  (0.05/100)  =  3.30).  This  criterion  calls  a  single 
test  as  significant  illustrating  the  conservative  nature  of  the  control  of  FWER  at  a 
low  value.  If  we  choose  instead  to  control  the  expected  number  of  false  discoveries 
at  1,  then  the  dotted  line  at  —  log10(l/100)  =  2  results.  We  see  that  all  5  true 
alternatives  are  selected,  along  with  a  single  false  positive.  Finally,  we  examine 
those  hypotheses  that  would  be  rejected  if  we  control  the  FDR  at  a  =  0.05,  via 
the  Benjamini-Hochberg  procedure.  On  the  log  to  the  base  10  scale,  the  potential 
thresholds  li  =  ia/m,i  =  1, ...  ,m  correspond  to  a  line  with  slope  1  and  intercept 
—  log10(a).  The  dotted-dashed  line  gives  the  FDR  threshold  (recall  the  FDR  is 
an  expectation)  corresponding  to  a  =  0.05.  The  use  of  this  threshold  gives  three 
p- values  as  significant,  for  an  empirical  FDR  of  zero. 


□ 

The  algorithm  of  Benjamini  and  Hochberg  (1995)  begins  with  a  desired  FDR 
and  then  provides  the  p- value  threshold.  Storey  (2002)  proposed  an  alternative 
method  by  which,  for  any  fixed  rejection  region,  a  criteria  closely  related  to  FDR, 
the  positive  false  discovery  rate  pFDR  =  E[/i//\’  |  K  >  0],  may  be  estimated. 
We  assume  rejection  regions  of  the  form  T  >  fflx  and  consider  the  pFDR  associated 
with  regions  of  this  form,  which  we  write  as  pFDR(fflx).  We  define,  for  7  =  1,...,  m 
tests,  the  random  variables  Hi  =  0/1  corresponding  to  null/alternative  hypotheses 
and  test  statistics  T).  Then,  with  7To  =  Pr(7T  =  0)  and  7Ti  =  1  —  7To  independently 
for  all  tests, 


_ Pr(T  >  |  H  =  0)  x  7T0 _ 

Pr(T  >  ffix  |  H  =  0)  x  7T0  +  Pr(T  >  4,  |  H  =  1)  x  -kx  ' 


pFDR(ffa) 
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Note  the  similarity  with  (4.6).  Consideration  of  the  false  discovery  odds: 

pFDRfoQ  =  Pr(r  >  fflx  |  =  0)  no 

1  -  pFDR(tta)  Pr(T  >t^\H  =1)X  m 

explicitly  shows  the  weighted  trade-off  of  type  I  and  type  II  errors,  with  weights 
determined  by  the  prior  on  the  null/alternative;  this  expression  mimics  (4.7).  Storey 
(2003)  rigorously  shows  that 

pFDR(fflx)  =  Pr (H  =  0  |  T  >  ffix). 


giving  a  Bayesian  interpretation.  In  terms  of  p- values,  the  rejection  region  corre¬ 
sponding  to  T  >  fflx  is  of  the  form  [0,7].  Let  P  be  the  random  p- value  resulting 
from  a  test.  Under  the  null,  P  ~  U(0, 1),  and  so 


pFDR(fflx) 


Pr(P  <  7  |  H  =  0)  x  7T0 
Pr (P  <  7) 

7  X  7T0 

Pr (P  <  7) 


(4.14) 


From  this  expression,  the  crucial  role  of  7r0  is  evident.  Storey  (2002)  estimates 
(4.14),  using  uniformity  of  p- values  under  the  null,  to  produce  the  estimates 


-  _  #fa  >  A} 

7r°  ro(  1  -  A) 

Pr(P  <  7)  =  MEiAjl  (4.15) 

m 

with  A  chosen  via  the  bootstrap  to  minimize  the  mean-squared  error  for  prediction 
of  the  pFDR.  The  expression  (4.15)  calculates  the  empirical  proportion  of  p- values 
to  the  right  of  A  and  then  inflates  this  to  account  for  the  proportion  of  null  p- values 
in  [0,  A], 

This  method  highlights  the  benefits  of  using  the  totality  of  p- values  to  estimate 
fundamental  quantities  of  interest  such  as  ttq.  In  general,  information  in  all  of  the 
data  may  also  be  exploited,  and  in  Sect.  4.6.2,  we  describe  a  Bayesian  mixture  model 
that  uses  the  totality  of  data. 

The  q-value  is  the  minimum  FDR  that  can  be  attained  when  a  particular  test  is 
called  significant.  We  give  a  derivation  of  the  g-value  and,  following  Storey  (2002), 
first  define  a  set  of  nested  rejection  regions  {fa}a=o  where  a  is  such  that  Pr(T  > 
ta  |  H  =  0)  =  a.  Then 


p-value(f)  =  infta:teta  Pr (T  >  ta  \  H  =  0) 
is  the  p- value  corresponding  to  an  observed  statistic  t.  The  (/-value  is  defined  as 

g-value(f)  =  infta:teta  Pr(P  =  0  |  T  >  ta).  (4.16) 
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Therefore,  for  each  observed  statistic  ti,  there  is  an  associated  q- value.  It  can  be 
shown  that  (Exercise  4.3) 

Pr(i?o  |  T  >  fobs)  <  Pr(fT0  |  T  =  tobs)  (4.17) 

so  that  the  evidence  for  Hq  given  the  exact  ordinate  is  always  greater  than  that 
corresponding  to  the  tail  area. 

When  one  decides  upon  a  value  of  FDR  (or  pFDR)  to  use  in  practice,  the  sample 
size  should  again  be  taken  into  account,  since  for  large  sample  size,  one  would  not 
want  to  tolerate  as  large  an  FDR  as  with  a  small  sample  size.  Again,  we  would  prefer 
a  procedure  that  was  consistent.  However,  as  in  the  single  test  situation,  there  is  no 
prescription  for  deciding  how  the  FDR  should  decrease  with  increasing  sample  size. 


Example:  Microarray  Data 

Returning  to  the  microarray  example,  application  of  the  Bonferroni  correction  to 
control  the  FWER  at  0.05  produces  a  list  of  220  significant  transcripts.  In  this 
context,  it  is  likely  that  there  are  a  large  proportion  of  non-null  transcripts  (Storey 
et  al.  2007)  and  there  are  relatively  large  sample  sizes  for  each  test  (so  the  power 
is  good),  and  so  this  choice  is  likely  to  be  very  conservative.  The  procedure  of 
Benjamini  and  Hochberg  with  FDR  control  at  0.05  gives  480  significant  transcripts. 
Applying  the  method  of  Storey  gives  an  estimate  of  the  proportion  of  nulls  as 
7Tq  =  0.33.  At  a  pFDR  threshold  of  0.05,  603  transcripts  are  highlighted. 


4.6.2  Bayesian  Analysis 


In  some  situations,  a  Bayesian  analysis  of  m  tests  may  proceed  in  exactly  the  same 
fashion  as  with  a  single  test,  that  is,  one  can  apply  the  same  procedure  m  times; 
see  Wakefield  (2007a)  for  an  example.  In  this  case  the  priors  on  each  of  the  m  null 
hypotheses  will  be  independent.  In  other  situations,  however,  one  may  often  wish 
to  jointly  model  the  data  so  that  the  totality  of  information  can  be  used  to  estimate 
parameters  that  are  common  to  all  tests. 

In  terms  of  reporting,  as  with  a  single  test  (as  considered  in  Sect.  4.3),  the  Bayes 
factors 


Bayes  Factor, 


p{y i  \Hj  =  o) 
p{yi  \Hi  =  iy 


(4.18) 


i  =  1, ...  ,m  are  a  starting  point.  These  Bayes  factors  may  then  be  combined  with 
prior  probabilities  7r0i  =  Pr(H,  =  0),  to  give 


Posterior  OddSj  =  Bayes  Factor^  x  Prior  Odds*,  (4.19) 


where  Prior  Odds;  =  7roi/(l  —  ttq i). 
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Proceeding  to  a  decision  theory  approach.  Suppose  for  simplicity  common 
losses,  Li  and  associated  with  type  1  and  type  2  errors,  for  each  test.  The  aim  is 
to  define  a  rule  for  deciding  which  of  the  m  null  hypotheses  to  reject.  The  operating 
characteristics,  in  terms  of  “false  discovery”  and  “non-discovery,”  corresponding 
to  this  rule  may  then  be  determined.  The  loss  associated  with  a  particular  set  of 
decisions  S  =  [<5i, . . . ,  5m\  and  hypotheses  H  =  [Hi, . . . ,  Hm]  is  the  expectation 
over  the  posterior 


E  [L(S,H)]  =  LiY, 
2=1 


Si  Pr (Hi  =  0  |  yi)  +  ^(1  -  Si)  Pr {Hi  =  1  |  y.t) 


—  Li 


EFP  +  —  x  EFN 

Lt 


where  EFP  is  the  expected  number  of  false  positives  and  EFN  is  the  expected 
number  of  false  negatives.  These  characteristics  of  the  procedure  are  given,  respec¬ 
tively,  by 


EFD=^^Pr(JTi  =  0|yi) 

2=1 

m 

EFN  =  ^^(1  —  Si)  Pr(iJj  =  1  |  Vi), 
2=1 


where  Pr  (Hi  =  0  |  yi)  and  Pr(fT;  =  1  |  yi)  are  the  posterior  probabilities  on  the 
null  and  alternative.  We  should  report  test  i  as  significant  if 


Pr(#2  =  1  |  yi)  > 


1 

1  +  Lu/ Li 


which  is  identical  to  the  expression  derived  for  a  single  test,  (4.4). 

Define  K  =  as  the  number  of  rejected  tests.  Then  dividing  EFD  by 

K  gives  an  estimate,  based  on  the  posterior,  of  the  proportion  of  false  discoveries, 
and  dividing  EFN  by  m  —  K  gives  a  posterior  estimate  of  the  proportion  of  false 
non-discoveries.  Hence,  for  a  given  ratio  of  losses,  we  can  determine  the  expected 
number  of  false  discoveries  and  false  non-discoveries,  and  the  FDR  and  FNR.  As 
Hi,  the  sample  size  associated  with  test  i,  increases,  under  correct  specification  of 
the  model,  the  power  for  each  test  increases,  and  so  EFD /  K  and  EFN/  (to  —  K )  will 
tend  to  zero  (assuming  the  model  is  correct).  This  is  in  contrast  to  the  frequentist 
approach  in  which  a  fixed  (independent  of  sample  size)  FDR  rule  is  used  so  that  the 
false  non-discovery  rate  does  not  decrease  to  zero  (even  when  the  model  is  true). 

Notice  that  the  use  of  Bayes  factors  does  not  depend  on  the  number  of  tests,  to,  so 
that,  for  example,  we  could  analyze  the  data  in  the  same  way  regardless  of  whether 
to  is  1  or  1,000,000.  Similarly,  for  the  assumed  independent  priors,  the  posterior 
probabilities  do  not  depend  on  to,  and  for  the  loss  structure  considered,  the  decision 
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does  not  depend  on  to.  Hence,  the  Bayes  procedure  gives  thresholds  that  depend 
on  n  (since  the  Bayes  factor  will  depend  on  sample  size,  see  Exercise  4. 1  for  an 
example)  but  not  on  to,  while  the  contrary  is  true  for  many  frequentist  procedures 
such  as  Bonferroni. 

There  is  a  prior  that  results  in  a  Bayesian  Bonferroni-type  correction.  If  the  prior 
probabilities  of  each  of  the  nulls  are  independent  with  noi  =  ttq  for  i  =  1 , ,m. 
Then  the  prior  probability  that  all  nulls  are  true  is 

7T0  =  Pr(iT!  =  0, . . .  ,Hm  =  0)  =  < 

which  we  refer  to  as  prior  Pi .  For  example,  if  7To  =  0.5  and  to  =  10,  IIq  =  0.00098, 
which  may  not  reflect  the  required  prior  belief.  Suppose  instead  that  we  wish  to  fix 
the  prior  probability  that  all  of  the  nulls  are  true  at  770.  A  simple  way  of  achieving 
this  is  to  take  not  =  i7g  m,  a  prior  specification  we  call  P2.  Westfall  et  al.  (1995) 
show  that  for  independent  tests 

aB  =  Pr (Hi  =  0  |  yi,  P2)  «  m  x  Pr (Hi  =  0  |  yt,  Pi)  =  m  x  a* 

so  that  a  Bayesian  version  of  Bonferroni  is  recovered. 

An  alternative  approach  is  to  specify  a  full  model  for  the  totality  of  data. 
These  data  can  then  be  exploited  to  estimate  common  parameters.  In  particular, 
the  proportion  of  null  tests  7To  can  be  estimated,  which  is  crucial  for  inference  since 
posterior  odds  and  decisions  are  (unsurprisingly)  highly  sensitive  to  the  value  of 
7To.  The  decision  is  still  based  on  the  posterior,  and  there  continues  to  be  a  trade-off 
between  false  positive  and  false  negatives  depending  on  the  decision  threshold  used. 
We  illustrate  using  the  microarray  data. 


Example:  Microarray  Data 

Recall  that  we  assume  Y)  |  fii  ~ind  N(/Xj,  of),  i  =  1, . . . ,  m  where  m  =  1,000. 
We  first  describe  a  Bayesian  analysis  in  which  the  to  transcripts  are  analyzed 
separately.  We  assume  under  the  null  that  /i,;  =0,  while  under  the  alternative 
Mi  ~iid  N(0,  r2)  with  r2  fixed.  For  illustration,  we  assume  that  for  non-null  genes, 
a  fold  change  in  the  mean  greater  than  10%,  that  is,  log2  Mi  >  0.138,  only  occurs 
with  probability  0.025.  Given 


T.  (  Mi 

Pr  I  —00  <  —  < 


log2(1-1)^ 


0.975 


we  can  solve  for  r  to  give 


log2(1-1) 

^-1(0.975) 


0.070, 


r  = 
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Fig.  4.7  Ordered 
—  log10  (Bayes  factors)  for 
the  microarray  data.  The 
dashed  line  at  0  is  for 
reference 
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where  </'(■)  is  the  distribution  function  of  a  standard  normal  random  variable. 
The  prior  on  /Zj  is  therefore 


0  with  probability  7r0 

N(0,  0.1382)  with  probability  m  =  1  —  7Tq 


The  Bayes  factor  for  the  ith  transcript  is 


(4.20) 


where  Z,;  =  Yi/<Ji  is  the  Z  score  for  the  ?th  transcript.  Therefore,  we  see  that 
the  Bayes  factor  depends  on  the  power  through  of  (which  itself  depends  on  the 
sample  sizes),  as  well  as  on  the  ./(-score,  while  the  p-value  depends  on  the  latter 
only.  In  Fig.  4.7,  we  plot  the  ordered  —  log10(  Bayes  factors )  (so  that  high  values 
correspond  to  evidence  against  the  null).  A  reference  line  of  0  is  indicated  and,  using 
this  reference,  for  487  transcripts  the  data  are  more  likely  under  the  alternative  than 
under  the  null. 

To  obtain  the  posterior  odds,  we  need  to  specify  a  prior  for  the  null.  We  assume 
7Tq  =  Pr(iTj  =  0)  so  that  the  prior  is  the  same  for  all  transcripts.  The  posterior  odds 
are  the  product  of  the  Bayes  factor  and  the  prior  odds  and  are  highly  sensitive  to 
the  choice  of  7r0.  For  illustration,  suppose  the  decision  rule  is  to  call  a  transcript 
significant  if  the  posterior  odds  of  H  =  0  are  less  than  1  (which  corresponds 
to  a  ratio  of  losses,  Ln/ =  1).  Figure  4.8  plots  the  number  of  such  significant 
transcripts  under  this  rule,  as  a  function  of  the  prior,  7 r0.  The  sensitivity  to  the 
choice  of  7r0  is  evident.  To  overcome  this  problem,  we  now  describe  a  joint  model 
for  the  data  on  all  m  =  1,000  transcripts  that  allows  estimation  of  parameters  that 
are  common  across  transcripts,  including  7r0.  Notice  that  for  virtually  the  complete 
range  of  7r0  more  transcripts  would  be  called  as  significant  under  the  Bayes  rule  than 
under  the  FWER. 
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Fig.  4.8  Number  of 

significant  transcripts  in  the 

microarray  data,  as  measured 
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We  specify  a  mixture  model  for  the  collection  [/xi, . . . ,  pm],  with 

J  0  with  probability  7To 

^  ‘  \  N(6,  r2)  with  probability  m  =  1  —  7ro 

We  use  mixture  component  indicators  Hi  =  0/1  to  denote  the  zero/normal 
membership  model  for  transcript  i.  Collapsing  over  /;,  gives  the  three-stage  model: 
Stage  One: 


Yi  |  Hi,  5,  T,  7T0 


rN-' ind 


N(O,0?)  if  -Hi  =  0 
N(<5,  of  +  r2)  if  Hi  =  1. 


Stage  Two:  Hi  |  7Ti  ~ud,  Bernoulli(7Ti),  i  =  1 , . . .  ,m. 


Stage  Three:  Independent  priors  on  the  common  parameters: 

p(S,  T,  7T0)  =  p(8)p(t)p(tt0). 


We  illustrate  the  use  of  this  model  with 

p(S)  oc  1, 
p(r)  oc  1/r 
P(  tto)  =  1, 


so  that  we  have  improper  priors  for  <5  and  r2.  The  latter  choice  still  produces  a 
proper  posterior  because  we  have  fixed  variances  at  the  first  stage  of  the  model 
(see  Sect.  8.6.2  for  further  discussion).  Implementation  is  via  a  Markov  chain  Monte 
Carlo  algorithm  (see  Sect.  3.8).  Exercise  4.4  derives  details  of  the  algorithm. 
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Fig.  4.9  Posterior  distributions  for  selected  parameters  of  the  mixture  model,  for  the  microarray 
data:  (a)  p(6  |  y),  (b)  p(r2  |  y),  (c)  p(tt0  |  y),  (d)p(5,r2  |  y),  (e)p{8,ir0  \  y ),  (f)p(r2,7r0  |  y) 


The  posterior  median  and  95%  interval  for  6  (xlO“3)  is  —6.8  [—9.4, —0.40], 
while  for  r2  (xlO-3),  we  have  1.1  [0.92,1.2].  Of  more  interest  are  the  posterior 
summaries  for  7r0:  0.29  [0.24,0.33],  giving  a  range  that  is  consistent  with  the  pFDR 
estimate  of  0.33.  Figure  4.9  displays  univariate  and  bivariate  posterior  distributions. 
The  distributions  resemble  normal  distributions,  reflecting  the  large  samples  within 
populations  and  the  number  of  transcripts. 
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Fig.  4.10  Posterior 
probabilities  Pr (Hi  =  1  |  y), 
from  the  mixture  model  for 
the  microarray  data,  for  each 

of  the  i  =  1 . 1,000 

transcripts,  ordered  in  terms 
of  increasing  posterior 
probability  on  the  alternative 


For  transcript  i,  we  may  evaluate  the  posterior  probabilities  of  the  alternative 


Pr  (Hi 


1  I  I/O  =  E  [Hi  I  y] 


Es,T2,TT0\y 

—  ^5,T2,TT0\y 

—  E 5,T2,ir0\y 


[Pr {Hi  |  <5,  r2, 7T0)] 

[Pr (Hi  =  1  |  y,5,T2,n0)] 

’ _ p{y  I  Hj  =  1  ,(5,T2)  x  TTi 

_p(y  I  Hi  =  M>t2)  x  TTi  +p{y  I  Hi 


0)  X  7T0_ 
(4.21) 


where 


P(y  I  Hi  =  1,  5,  t2,  t r0)  =  [2?r(of  +  r2)]  1/2  exp 
P{y  I  =  0,5,t2,7To)  =  [27ro-2]”1/2exp 


^  -  ^)2 

2K2+t2)J 


2  1 


2a,2 


Expression  (4.21)  averages  Pr  [Hi  =  1  \  y,  6,  t2,ttq)  with  respect  to  the  posterior 
p(5,  r2,  7Tq  |  y)  and  may  be  simply  evaluated  via 


^y-' _ p{y  1  Hj  =  1  ,£(*),  T2W)74f) _ 

T  4=1  p{y  I  Hi  =  1, 5(*), r2(*), 7T^t))7T^t)  +p(y  I 

given  samples  ,ttq  \  t  =  1 ,T,  from  the  Markov  chain. 

Figure  4.10  displays  the  ordered  posterior  probabilities,  Pr  (Hi  =  1  |  y),  i  = 
1, . . .  ,m,  along  with  a  reference  line  of  0.5.  Using  this  line  as  a  threshold,  689 
transcripts  are  flagged  as  “significant,”  and  the  posterior  estimate  of  the  proportion 


178 


4  Hypothesis  Testing  and  Variable  Selection 


of  false  discoveries  is  0.12.  Interestingly,  the  posterior  estimate  of  the  proportion 
of  false  negatives  (i.e.,  non-discoveries)  is  0.35.  The  latter  figure  is  rarely  reported 
but  is  a  useful  summary.  Previously,  using  a  pFDR  threshold  of  0.05,  there  were 
603  significant  transcripts.  Interestingly,  using  a  rule  that  picked  the  603  transcripts 
whose  posterior  probability  on  the  alternative  was  highest  yielded  an  estimate  of  the 
posterior  probability  of  the  proportion  of  false  discoveries  as  0.07,  which  is  not  very 
different  from  the  pFDR  estimate.  This  is  reassuring  for  both  the  Bayesian  and  the 
pFDR  approaches. 

For  this  example,  sensitivity  analyses  might  relax  the  independence  between 
transcripts  and,  more  importantly,  the  normality  assumption  for  the  random  ef¬ 
fects  /ii. 

The  Bayes  factor,  (4.20),  was  derived  under  the  assumption  of  a  normal  sampling 
likelihood.  In  general,  if  we  have  large  sample  sizes,  we  may  take  as  likelihood 
the  sampling  distribution  of  an  estimator  and  combine  this  with  a  normal  prior,  to 
give  a  closed-form  estimator.  The  latter  is  an  approximation  to  a  Bayesian  analysis 
with  weakly  informative  priors  on  the  nuisance  parameters  and  was  described  in 
Sect.  3.11,  with  Bayes  factor  (3.45). 


4.7  Testing  Multiple  Hypotheses:  Variable  Selection 

A  ubiquitous  issue  in  regression  modeling  is  deciding  upon  which  covariates  to 
include  in  the  model.  It  is  useful  to  distinguish  three  scenarios: 

1.  Confirmatory:  In  which  a  summary  of  the  strength  of  association  between  a 
response  and  covariates  is  required.  We  include  in  this  category  the  situation 
in  which  an  a  priori  hypothesis  concerning  a  particular  response/covariate 
relationship  is  of  interest;  additional  variables  have  been  measured  and  we  wish, 
for  example,  to  know  which  to  adjust  for  in  order  to  reduce  confounding. 

2.  Exploration:  In  which  the  aim  is  to  gain  clues  about  structure  in  the  data. 
A  particular  example  is  when  one  wishes  to  gain  leads  as  to  which  covariates 
are  associated  with  a  response,  perhaps  to  guide  future  study  design. 

3.  Prediction:  In  which  we  are  not  explicitly  concerned  with  association  but  merely 
with  predicting  a  response  based  on  a  set  of  covariates.  In  this  case,  we  are 
not  interested  in  the  numerical  values  of  parameters  but  rather  in  the  ability  to 
predict  new  outcomes.  Chapters  10-12  examines  prediction  in  detail,  including 
the  assessment  of  predictive  accuracy. 

For  exploration,  formal  inference  is  not  required  and  so  we  will  concentrate  on  the 
confirmatory  scenario.  As  we  will  expand  upon  in  Sect.  5.9,  a  trade-off  must  be 
made  when  deciding  on  variables  for  inclusion  and  it  is  often  not  desirable  to  fit 
the  full  model.  To  summarize  the  discussion,  as  we  include  more  covariates  in  the 
model,  bias  in  estimates  is  reduced,  but  variability  may  be  increased,  depending  on 
how  strong  a  predictor  the  covariate  is  and  on  its  association  with  other  covariates. 
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Example:  Prostate  Cancer 

To  illustrate  a  number  of  the  methods  available  for  variable  selection,  we  consider 
a  dataset  originally  presented  by  Stamey  et  al.  (1989)  and  introduced  in  Sect.  1.3.1. 
The  data  were  collected  on  n  =  97  men  before  radical  prostatectomony.  We  take 
as  response  the  log  of  prostate-specific  antigen  (PSA)  which  was  being  forwarded 
in  the  paper  as  a  preoperative  marker,  that  is,  a  predictor  of  the  clinical  stage  of 
cancer.  The  authors  examined  log  PSA  as  a  function  of  eight  covariates:  log(can 
vol);  log(weight)  (where  weight  is  prostate  weight);  age;  log(BPH);  SVI;  log(cap 
pen);  the  Gleason  score,  referred  to  as  gleason;  and  percentage  Gleason  score  4  or 
5,  referred  to  as  PGS45. 

Figure  1.1  shows  the  relationships  between  the  response  and  each  of  the 
covariates  and  indicates  what  look  like  a  number  of  strong  associations,  while 
Fig.  1.2  gives  some  idea  of  the  dependencies  among  the  more  strongly  associated 
covariates.  After  Sect.  4.9,  we  will  return  to  this  example,  after  describing  a  number 
of  methods  for  selecting  variables  in  Sect.  4.8  and  discussing  model  uncertainty  in 
Sect.  4.9. 


4.8  Approaches  to  Variable  Selection  and  Modeling 

We  now  review  a  number  of  approaches  to  variable  selection.  Let  k  be  the  number 
of  covariates,  and  for  ease  of  exposition,  assume  each  covariate  is  either  binary  or 
continuous,  so  that  the  association  is  summarized  by  a  univariate  parameter.  We  also 
exclude  interactions  so  that  the  largest  model  contains  k  +  1  regression  coefficients. 
Allowing  for  the  inclusion/exclusion  of  each  covariate  only,  there  are  2fc  possible 
models,  a  number  which  increases  rapidly  with  k.  For  example,  with  k  =  20  there 
are  1,  048, 576  possible  models.  The  number  of  models  increases  even  more  rapidly 
with  the  number  of  covariates,  if  we  allow  variables  with  more  than  two  levels  and/or 
interactions. 

The  hierarchy  principle  states  that  if  an  interaction  term  is  included  in  the  model, 
then  the  constituent  main  effects  should  be  included  also.  If  we  do  not  apply  the 
hierarchy  principle,  there  are  22  -1  interaction  models  (i.e.,  models  that  include 
main  effects  and/or  interactions),  where  k  is  the  number  of  variables.  For  example, 
k  =  2  leads  to  8  models.  Denoting  the  variables  by  A  and  B,  these  models  are 

1,  A,  B ,  A  +  B ,  A  +  B  +  A.B,  A  +  A.B ,  B  +  A.B,  A.B. 

The  class  of  hierarchical  models  includes  all  models  that  obey  the  hierarchy 
principle.  Applying  the  hierarchy  principle  in  the  k  =  2  case  reduces  the  number 
from  8  to  5,  as  we  lose  the  last  three  models  in  the  above  list.  With  k  =  5 
variables,  there  are  2,147,483,648  interaction  models,  illustrating  the  sharp  increase 
in  the  number  of  models  with  k.  There  is  no  general  rule  for  counting  the  number 
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of  models  that  satisfy  the  hierarchy  principle  for  a  given  dimension.  For  some 
discussion,  see  Darroch  et  al.  (1980,  Sect.  6).  The  latter  include  a  list  of  the  number 
of  hierarchical  models  for  k  =  1, . . . ,  5;  for  k  =  5,  the  number  of  hierarchical 
models  is  7,580. 

We  begin  by  illustrating  the  problems  of  variable  selection  with  a  simple 
example. 


Example:  Confounder  Adjustment 

Suppose  the  true  model  is 


Vi  =  0o  +  PiXu  +  P2X21  +  (4.22) 

with  ej  |  <r2  ~ud  N(0,  er2),  i  =  1 ,n.  We  take  x\  as  the  covariate  of  interest, 
so  that  estimation  of  /3i  is  the  focus.  However,  we  decide  to  “control”  for  the 
possibility  of  fa  7^  0  via  a  test.  For  simplicity,  we  assume  that  a1  is  known  and 
assess  significance  by  examining  whether  a  95%  confidence  interval  for  fa  contains 
zero  (which  is  equivalent  to  a  two-sided  hypothesis  test  with  a  =  0.05).  If  the 
interval  contains  zero,  then  the  model. 


E [Yi  |  xii,x2i\  =  fa  + P*xu, 

is  fitted;  otherwise,  we  fit  (4.22).  We  illustrate  the  effects  of  this  procedure  through 
a  simulation  in  which  we  take  fa  =  fa  =  fa  =  1,  a2  =  32,  and  n  =  10. 
The  covariates  x\  and  x-2  are  simulated  from  a  bivariate  normal  with  means  zero, 
variances  one  and  correlation  0.7. 

In  Fig.  4.11a,  we  display  the  sampling  distribution  of  /?  1  given  the  fitting  of 
model  (4.22).  The  mean  and  standard  deviation  of  the  distribution  of  fa  are  1.00 
and  1.23,  respectively.  Unbiasedness  follows  directly  from  least  squares/likelihood 
theory  (Sect.  5.6). 

Figure  4.11b  displays  the  sampling  distribution  of  the  reported  estimator  when 
we  allow  for  the  possibility  of  adjustment  according  to  a  test  of  /32  j=-  0.  The  mean 
and  standard  deviation  of  the  distribution  of  the  reported  estimator  of  fa  are  1 .23  and 
1.01,  respectively,  showing  positive  bias  and  a  reduced  variance.  This  distribution 
is  a  mixture  of  the  sampling  distribution  of  fa  (the  estimator  obtained  from  the 
full  model),  and  the  sampling  distribution  of  /?*,  with  the  mixing  weight  on  the 
latter  corresponding  to  one  minus  the  power  of  the  test  of  fa  =  0.  The  sampling 
distribution  of  /3*  is  shifted  because  the  effects  of  both  x\  and  x'2,  are  being  included 
in  the  estimate  and  the  distribution  is  shifted  to  the  right  because  X\  and  x-2  are 
positively  correlated.  Using  the  conditional  mean  of  a  bivariate  normal  (given  as 
(D.l)  in  Appendix  D)  we  have 
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E[Y  |  Xl]  =  /30  +  Pixi  +  p2E[X2  |  a*] 
=  Ad  +  (Al  +  0.7)  x  xi 
=  Ad  +  At^i 


illustrating  the  bias, 

E[3i]  -Pi  =  0.7  (4.23) 

when  the  reduced  model  is  fitted.  Allowing  for  the  possibility  of  adjustment  gives 
an  estimator  with  a  less  extreme  bias,  since  sometimes  the  full  model  is  fitted  (if 
the  null  is  rejected).  The  reason  for  the  lower  reported  variance  in  the  potentially 
adjusted  analysis  is  the  bias-variance  trade-off  intrinsic  to  variable  selection.  In 
model  (4.22),  the  information  concerning  At  and  B2  is  entangled  because  of  the 
correlation  between  X\  and  x2,  which  results  in  a  higher  variance.  Section  5.9 
provides  further  discussion.  The  reported  variance  is  not  appropriate,  however, 
since  it  does  not  acknowledge  the  model  building  process,  an  issue  we  examine 
in  Sect.  4.9.  As  n  — >  oo,  the  power  of  the  test  to  reject  j32  =  0  tends  to  1,  and  we 
recover  an  unbiased  estimator  with  an  appropriate  variance. 

□ 


4.8.1  Stepwise  Methods 

A  number  of  methods  have  been  proposed  that  proceed  in  a  stepwise  fashion, 
adding  or  removing  variables  from  a  current  model.  We  describe  three  of  the  most 
historically  popular  approaches. 

Forward  selection  begins  with  the  null  model,  E[lr  |  x]  =  Ad,  and  then  fits  each 
of  the  models 

E[Y  |  x\  =  fio+fijXj, 

j  =  1, . . . ,  k.  Subject  to  a  minimal  requirement  (i.e.,  a  particular  p- value  threshold), 
the  model  that  contains  the  covariate  that  provides  the  greatest  “improvement”  in  fit 
is  then  carried  forward.  This  procedure  is  then  iterated  until  no  covariates  meet  the 
minimal  requirement  (i.e.,  all  the  p-values  are  greater  than  the  threshold),  or  all  the 
variables  are  in  the  model. 

Backward  elimination  has  the  same  flavor  but  begins  with  the  full  model,  and 
then  removes,  at  each  stage,  the  covariate  that  is  contributing  least  to  the  fit.  For 
example,  the  variable  with  the  largest  p- value,  so  long  as  it  is  bigger  than  some 
prespecified  value,  is  removed  from  the  model. 

Each  of  these  approaches  can  miss  important  models.  For  example,  in  forward 
selection,  X\  may  be  the  “best”  single  variable,  but  X\  and  any  other  variable 
may  be  “worse”  than  x2  and  *3  together  (say),  but  the  latter  combination  will 
never  be  considered.  Related  problems  can  occur  with  backward  elimination.  Such 
considerations  lead  to  Efroymson’s  algorithm  (Efroymson  1960)  in  which  forward 
selection  is  followed  by  backward  elimination.  The  initial  steps  are  identical  to 
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Fig.  4.11  (a)  Sampling 
distribution  of  j3\ ,  controlling 
for  X2,  and  (b)  sampling 
distribution  of  0i ,  given  the 
possibility  of  controlling 
for  X2 


a 


o 

o 


-50  5  10 

A 

Pi,  with  adjustment 


-50  5  10 


A 

fS-i,  without  adjustment 


forward  selection,  but  with  three  or  more  variables  in  the  model,  the  loss  of  fit 
of  each  of  the  variables  (excluding  the  last  one  added)  is  examined,  in  order  to 
avoid  the  scenario  just  described,  since  in  this  case  if  the  order  of  variables  being 
added  was  X\,X2,  x:$,  it  would  then  be  possible  for  Xi  to  be  removed.  The  “p-value 
to  enter”  value  (i.e.,  the  threshold  for  forward  selection)  is  chosen  to  be  smaller 
than  the  “p- value  to  remove”  value  (i.e.,  the  threshold  for  backward  elimination), 
to  prevent  cycling  in  which  a  variable  is  continually  added  and  then  removed. 
The  choice  of  inclusion/exclusion  values  is  contentious  for  forward  selection, 
backward  elimination  and  Efroymson’s  algorithm. 


4.8  Approaches  to  Variable  Selection  and  Modeling 
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The  Efroymson  procedure,  although  overcoming  some  of  the  deficiencies  of 
forward  selection  and  backwards  elimination,  can  still  miss  important  models. 
The  overall  frequentist  properties  of  any  subset  selection  approach  are  difficult  to 
determine,  as  we  discuss  in  Sect.  4.9. 

Each  of  the  stepwise  approaches  may  miss  important  models.  A  popular 
alternative  is  to  examine  all  possible  models  and  to  then  select  the  “best”  model. 
We  next  provide  a  short  summary  of  some  of  the  criteria  that  have  been  suggested 
for  this  selection. 


4.8.2  All  Possible  Subsets 

We  first  consider  linear  models  and  again  suppose  there  are  k  potential  regressors, 
with  the  full  model  of  the  form 


y  =  x/3  +  e 


(4.24) 


with  E[e]  =  0,  var(e)  =  cr2ln,  and  where  y  is  n  x  1,  x  is  n  x  (k  +  1),  and  /3  is 
(k  +  1)  x  1. 

The  R 2  measure  of  variance  explained  is 


Rz  =  1  - 


RSS 

CTSS 


where  the  residual  and  corrected  total  sum  of  squares  are  given,  respectively,  by 

RSS  =  (y  -  x/3f(y  -  x{3) 

CTSS  =  (y~  1  yf(y  -  1  y). 


Consequently,  R2  can  be  interpreted  as  measuring  the  closeness  of  the  fit  to  the  data, 
with  R2  =  1  for  a  perfect  fit  (RSS  =  0)  and  R2  =  0  if  the  model  does  not  improve 
upon  the  intercept  only  model.  In  terms  of  a  comparison  of  nested  models,  the  R2 
measure  is  nondecreasing  in  the  number  of  variables,  and  so  picking  the  model  with 
the  smallest  R2  will  always  produce  the  full  model. 

Let  P  represent  a  model  constructed  from  covariates  whose  indices  are  a  subset 
of  {1,2,...,  fc},  withp  =  \P\  + 1  regression  coefficients  in  this  model.  The  number 
of  parameters  p  accounts  for  the  inclusion  of  an  intercept  so  that  in  the  full  model 
p  =  k  +  1.  Suppose  the  fit  of  model  P  yields  estimator  (3P  and  residual  sum 
of  squares  RSSP.  For  model  comparison,  a  more  useful  measure  than  R2  is  the 
adjusted  R2  which  is  defined  as 


R 


2 

a 


RSS  P/(n  —  p) 

CTSS/(n-  1) 


1  - 


n  —  1 


(1  -  R2) 


n  —  p 
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Maximization  of  R2  leads  to  the  model  that  produces  the  smallest  estimate  of  a2 
across  models. 

A  widely  used  statistic,  known  as  Mallows  CP,  was  introduced  by  Mallows 
(Mallows  1973). 1  For  the  model  associated  with  the  subset  P 

RSSp  ,  . 

CP  =  -  n  -  2 p)  (4.25) 

where  RSSP  =  (y  —  xj3P)T(y  —  xf3P)  is  the  residual  sum  of  squares  and  a2  = 
RSSfc/(n—  k—  1)  is  the  error  variance  estimate  from  the  full  model  that  contains  all  k 
covariates.  This  criteria  may  be  derived  via  consideration  of  the  prediction  error  that 
results  from  choosing  the  model  under  consideration  (as  we  show  in  Sect.  10.6.1). 
It  is  usual  to  plot  CP  versus  p  and  for  a  good  model  CP  will  be  close  to,  or  below, 
p,  since  E[RSS P]  =  [n  —  p)c r2  and  so  E[CP]  =  p  for  a  good  model. 

Lindley  (1968)  showed  that  Mallows  CP  can  also  be  derived  from  a  Bayesian 
decision  approach  to  multiple  regression  in  which,  among  other  assumptions,  the 
aim  is  prediction  and  the  X’s  are  random  and  multivariate  normal. 

We  now  turn  to  more  general  models  than  (4.24).  Consideration  of  the  likelihood 
alone  is  not  useful  since  the  likelihood  increases  as  parameters  are  added  to  the 
model,  as  we  saw  with  the  residual  sum  of  squares  in  linear  models.  A  number 
of  penalized  likelihood  statistics  have  been  proposed  that  penalize  models  for  their 
complexity.  A  large  number  of  statistics  have  been  proposed,  but  we  concentrate 
on  just  two,  AIC  and  BIC.  An  Information  Criteria  (AIC,  Akaike  1973)  is  a 
generalization  of  Mallows  CP  and  is  defined  as 

AIC  =  -210 P)  +  2  p  (4.26) 

where  l(/3P)  denotes  the  maximized  log-likelihood  of,  and  p  the  number  of 
parameters  in,  model  P.  A  derivation  of  AIC  is  presented  in  Sect.  10.6.5.  We  have 
already  encountered  the  Bayesian  information  criterion  (BIC)  in  Sect.  3.10  as  an 
approximation  to  a  Bayes  factor.  The  BIC  is  given  by 

BIC  =  —2l(/3P)  +p\ogn. 

For  the  purposes  of  model  selection,  one  approach  is  to  choose  between  models  by 
selecting  the  one  with  the  minimum  AIC  or  BIC.  In  general,  BIC  penalizes  larger 
models  more  heavily  than  AIC,  so  that  in  practice  AIC  tends  to  pick  models  that  are 
more  complicated.  As  an  indication,  for  a  single  parameter  (p  =  1  in  (4.26)),  the 
significance  level  is  a  =  0.157  corresponding  to  Pr(%2  <  2),  which  is  a  very  liberal 
threshold.  Given  regularity  conditions,  BIC  is  consistent  (Haughton  1988,  1989; 
Rao  and  Wu  1989),  meaning  if  the  correct  model  is  in  the  set  being  considered, 
it  will  be  picked  with  a  probability  that  approaches  1  with  increasing  sample  size. 


'Named  in  honor  of  Cuthbert  Daniel  with  whom  Mallows  initially  discussed  the  use  of  the  CP 
statistic. 
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while  AIC  is  not.  The  appearance  of  n  in  the  penalty  term  of  BIC  is  not  surprising, 
since  this  is  required  for  consistency. 


4.8.3  Bayesian  Model  Averaging 

Rather  than  select  a  single  model,  Bayesian  model  averaging  (BMA)  places  priors 
over  the  candidate  models,  and  then  inference  for  a  function  of  interest  is  carried 
out  by  averaging  over  the  posterior  model  probabilities.  Section  3.6  described 
this  approach  in  detail,  and  we  will  shortly  demonstrate  its  use  with  the  prostate 
cancer  data. 


4.8.4  Shrinkage  Methods 

An  alternative  approach  to  selecting  a  model  is  to  consider  the  full  model  but 
to  allow  shrinkage  of  the  least  squares  estimates.  Ridge  regression  and  the  lasso 
fit  within  this  class  of  approaches  and  are  considered  in  detail  in  Sects.  10.5.1 
and  10.5.2,  respectively.  Such  methods  are  often  used  in  situations  in  which  the 
data  are  sparse  (in  the  sense  of  k  being  large  relative  to  n). 


4.9  Model  Building  Uncertainty 

If  a  single  model  is  selected  on  the  basis  of  a  stepwise  method  or  via  a  search  over 
all  models,  then  bias  will  typically  result.  Interval  estimates,  whether  they  be  based 
on  Bayesian  or  frequentist  approaches,  will  tend  to  be  too  narrow  since  they  are 
produced  by  conditioning  on  the  final  model  and  hence  do  not  reflect  the  mechanism 
by  which  the  model  was  selected;  see  Chatfield  (1995)  and  the  accompanying 
discussion. 

To  be  more  explicit,  let  P  denote  the  procedure  by  which  a  final  model  M  is 
selected,  and  suppose  it  is  of  interest  to  examine  the  properties  of  an  estimator 
cf>  of  a  univariate  parameter  0,  for  example,  a  regression  coefficient  associated 
with  a  covariate  of  interest.  The  usual  frequentist  unbiasedness  results  concern 
the  expectation  of  an  estimator  within  a  fixed  model.  We  saw  an  example  of  bias 
following  variable  selection,  with  the  bias  given  by  (4.23).  In  general,  the  estimator 
obtained  from  a  selection  procedure  will  not  be  unbiased  with  respect  to  the  final 
model  chosen,  that  is, 


E$\P]=EM\p[E($\M)] 
±  E(0  |  M), 


(4.27) 

(4.28) 
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where  M  is  the  final  model  chosen.  In  addition, 

var(^  |  P)  =  EM|P[var(<£  |  M)\  +  varM|p(E[0  |  M})  (4.29) 

/  var(^  |  M)  (4.30) 

where  the  latter  approximates  the  first  term  of  (4.29)  only.  Hence,  in  general,  the 
reported  variance  conditional  on  a  chosen  model  will  be  an  underestimate.  The  bias 
and  variance  problems  arise  because  the  procedure  by  which  M  was  chosen  is  not 
being  acknowledged. 

From  a  Bayesian  standpoint,  the  same  problem  exists  because  the  posterior 
distribution  should  reflect  all  sources  of  uncertainty  and  a  priori  all  possible  models 
that  may  be  entertained  should  be  explicitly  stated,  with  prior  distributions  being 
placed  upon  different  models  and  the  parameters  of  these  models.  Model  averaging 
should  then  be  carried  out  across  the  different  possibilities,  a  process  which  is 
fraught  with  difficulties  not  least  in  placing  “comparable”  priors  over  what  may 
be  fundamentally  different  objects  (see  Sect.  6.16.3  for  an  approach  to  rectifying 
this  problem).  Suppose  there  are  m  potential  models  and  that  p:]  =  Pr(My  |  y)  is 
the  posterior  probability  of  model  j,  j  =  1, . . . ,  m  .  Then 

k 

E[<^>  I  V\  =  ^2  E[0  |  Mj,y]  x  pj 

3= 1 

?E[</>\M,y],  (4.31) 

where  the  latter  is  that  which  would  be  reported,  based  on  a  single  model  M.  The 
“bias”  is  E [<j>  \  M,  y]  —  E \<j>  \  y\.  In  addition, 


var(</>  \y)  =  J2  var(^  |  Mhy )  xPj  (E [<j>  \  Mjty ]  -  E[0  |  y})2  x  Pj 

3=1  3=1 

(4.32) 

±  var(<^>  |  M,y),  (4.33) 

so  that  the  variance  in  the  posterior  acknowledges  both  the  weighted  average  of  the 
within-model  variances,  via  the  first  term  in  (4.32),  and  the  weighted  contributions 
to  the  between-model  variability,  via  the  second  term.  Note  the  analogies  between 
the  frequentist  and  Bayesian  biases,  (4.28)  and  (4.31),  and  the  reported  variances, 
(4.30)  and  (4.33). 

The  fundamental  message  here  is  that  carrying  out  model  selection  leads  to 
estimators  whose  frequency  properties  are  not  those  of  the  estimators  without  any 
tests  being  performed  (Miller  1990;  Breiman  and  Spector  1992)  and  Bayesian 
single  model  summaries  are  similarly  misleading.  This  problem  is  not  unique  to 
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Table  4.3  Parameter  estimates,  standard  errors,  and  T  statistics  for  the  prostate  cancer  data.  The 
full  model  and  models  chosen  by  stepwise/BIC  and  CP! AIC  are  reported 


Variable 

Full  model 

Est.  (Std.  err.) 

T  stat. 

Stepwise/BIC  model 
Est.  Std.  err.  T  stat. 

Cp /AIC  model 
Est.  Std.  err. 

T  stat. 

1 

logfcan  vol) 

0.59 

(0.088) 

6.7 

0.55 

(0.075) 

7.4 

0.57 

(0.075) 

7.6 

2 

log(weight) 

0.46 

(0.17) 

2.7 

0.51 

(0.15) 

3.9 

0.42 

(0.17) 

2.5 

3 

age 

-0.020 

(0.011) 

-1.8 

- 

- 

- 

-0.015 

(0.011) 

-1.4 

4 

log(BPH) 

0.11 

(0.058) 

1.8 

- 

- 

- 

0.11 

(0.058) 

1.9 

5 

SVI 

0.77 

(0.24) 

3.1 

0.67 

(0.21) 

3.2 

0.72 

(0.21) 

3.5 

6 

log(cap  pen) 

-0.11 

(0.091) 

-1.2 

- 

- 

- 

- 

- 

- 

7 

gleason 

0.045 

(0.16) 

0.29 

- 

- 

- 

- 

- 

- 

8 

PGS45 

0.0045 

(0.0044) 

1.0 

- 

- 

- 

- 

- 

- 

(7 

0.78 

- 

- 

0.72 

- 

- 

0.71 

- 

- 

variable  selection.  Similar  problems  occur  when  other  forms  of  model  refinement 
are  entertained,  such  as  transformations  of  y  and/or  x,  or  experimenting  with  a 
variety  of  variance  models  and  error  distributions. 


Example:  Prostate  Cancer 

We  begin  by  fitting  the  full  model  containing  all  eight  variables.  Table  4.3  gives  the 
coefficients,  standard  errors,  and  T  statistics.  For  this  example,  the  forward  selection 
and  backward  elimination  stepwise  procedures  all  lead  to  the  same  model  containing 
the  three  variables  log(can  vol),  log(weight),  and  SVI.  The  p-value  thresholds  were 
chosen  to  be  0.05.  The  standard  errors  associated  with  the  significant  variables  all 
decrease  for  the  reduced  model  when  compared  to  the  full  model.  This  behavior 
reflects  the  bias-variance  trade-off  whereby  a  reduced  model  may  have  increased 
precision  because  of  the  fewer  competing  explanations  for  the  data  (for  more 
discussion,  see  Sect.  5.9).  We  emphasize,  however,  that  uncertainty  in  the  model 
search  is  not  acknowledged  in  the  estimates  of  standard  error.  We  see  that  the 
estimated  standard  deviation  is  also  smaller  in  the  reduced  model. 

Turning  now  to  methods  that  evaluate  all  subsets.  Figure  4.12  plots  the  CP 
statistic  versus  the  number  of  parameters  in  the  model.  For  clarity,  we  do  not  include 
models  with  less  than  four  parameters  in  the  plot,  since  these  were  not  competitive. 
Recall  that  we  would  like  models  with  a  small  number  of  parameters  whose  CP 
value  is  close  to  or  less  than  the  line  of  equality.  The  variable  plotting  labels  are 
given  in  Table  4.3.  For  these  data,  we  pick  out  the  model  with  variables  labeled  1, 
2,  3,  4,  and  5  since  this  corresponds  to  a  model  that  is  close  to  the  line  in  Fig.  4.12 
and  has  relatively  few  parameters.  The  five  variables  are  logfcan  vol),  log(weight), 
age,  log(BPH),  and  SVI,  so  that  age  and  log(BPH)  are  added  to  the  stepwise  model. 

Carrying  out  an  exhaustive  search  over  all  main  effects  models,  using  the  adjusted 
R2  to  pick  the  best  model  (which  recall  is  equivalent  to  picking  that  model  with 


188 


4  Hypothesis  Testing  and  Variable  Selection 


Fig.  4.12  Mallows'  Cp 
statistic  plotted  versus  p, 
where  p  —  1  is  the  number  of 
covariates  in  the  model,  for 
the  prostate  cancer  data. 

The  line  of  equality  is 
indicated,  for  a  good  model 
E[Cp]  ~  p,  where  the 
expectation  is  over  repeated 
sampling.  The  variable  labels 
are  given  in  Table  4.3 


p 


the  smallest  a2),  gives  a  model  with  seven  variables  (gleason  is  the  variable  not 
included).  The  estimate  of  the  error  variance  is  a  =  0.70.  The  minimum  B1C  model 
was  the  same  model  as  picked  by  the  stepwise  procedures. 

We  used  Bayesian  model  averaging  with,  for  illustration,  equal  weights  on  each 
of  the  2s  models  and  weakly  informative  priors.  The  most  probable  model  has 
posterior  probability  0.20  and  contains  log(can  vol),  log(weight),  and  SVI,  while 
the  second  replaces  log(weight)  with  log(BPH)  and  has  posterior  probability  0.09. 
The  third  most  probable  model  adds  log(BPH)  to  the  most  probable  model  and 
has  probability  0.037.  Cumulatively  across  models,  the  posterior  probability  that 
log(can  vol)  is  in  the  model  is  close  to  1,  with  the  equivalent  posterior  probabilities 
for  SVI,  log(weight),  and  log(BPH)  being  0.69, 0.66,  and  0.27,  respectively.  A  more 
detailed  practical  examination  of  BMA  is  presented  at  the  end  of  Chap.  10. 


4.10  A  Pragmatic  Compromise  to  Variable  Selection 

One  solution  to  deciding  upon  which  variables  for  inclusion  in  a  regression  model 
is  to  never  refine  the  model  for  a  given  dataset.  This  approach  is  philosophically 
pure  but  pragmatically  dubious  (unless  one  is  in  the  context  of,  say,  a  randomized 
experiment)  since  we  may  obtain  appropriate  inference  for  a  model  that  is  a  very 
poor  description  of  the  phenomenon  under  study.  It  is  hard  to  state  general  strategies, 
but  on  some  occasions,  it  may  be  safest,  and  the  most  informative,  to  report  multiple 
models. 

We  consider  situations  that  are  not  completely  confirmatory  and  not  completely 
exploratory.  Rather  we  would  like  to  obtain  a  good  description  of  the  phenomena 
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under  study  and  also  have  some  faith  in  reported  interval  estimates.  The  philosophy 
suggested  here  is  to  think  as  carefully  as  possible  about  the  model  before  the 
analysis  proceeds.  In  particular,  context-specific  models  should  be  initially  posited. 
Hopefully  the  initial  model  provides  a  good  description,  but  after  fitting  the  model, 
model  checking  should  be  carried  out  and  the  model  may  be  refined  in  the  face  of 
clear  model  inadequacy,  with  refinement  ideally  being  carried  out  within  distinct  a 
priori  known  classes.  A  key  requirement  is  to  describe  the  procedure  followed  when 
the  results  are  reported. 

If  a  model  is  chosen  because  it  is  clearly  superior  to  the  alternatives  then,  roughly 
speaking,  inference  may  proceed  as  if  the  final  model  were  the  one  that  was  chosen 
initially.  This  is  clearly  a  subjective  procedure  but  can  be  informally  justified  via 
either  frequentist  or  Bayesian  approaches.  From  a  frequentist  viewpoint,  it  may  be 
practically  reasonable  to  assume,  with  respect  to  (4.28),  that  E \<j>  \  P]  ss  E [</>  | 
M]  because  M  would  be  almost  always  chosen  in  repeated  sampling  under  these 
circumstances.  In  a  similar  vein,  under  a  Bayesian  approach,  the  above  procedure 
is  consistent  in  which  model  averaging  in  which  the  posterior  model  weight  on  the 
chosen  model  is  close  to  1  (since  alternative  models  are  only  rejected  on  the  basis 
of  clear  inadequacy),  that  is,  with  reference  to  (4.31),  E[4>  \  y }  ss  E [(/)  \  M,y], 
because  Pr (M  \  y)  ss  1.  The  aim  is  to  provide  probability  statements,  from  either 
philosophical  standpoints  that  are  “honest”  representations  of  uncertainty. 

The  same  heuristic  applies  more  broadly  to  examination  of  model  choice,  beyond 
which  variables  to  put  in  the  mean  model.  As  an  example  of  when  the  above 
procedure  should  not  be  applied,  examining  quantile-quantile  plots  of  residuals  for 
different  Student’s  t  distributions  and  picking  the  one  that  produces  the  straightest 
line  would  not  be  a  good  idea. 


4.11  Concluding  Comments 

In  this  chapter,  we  have  discussed  frequentist  and  Bayesian  approaches  to  hypothesis 
testing.  With  respect  to  variable  selection,  we  make  the  following  tentative  conclu¬ 
sions.  For  pure  confirmatory  studies,  one  should  not  carry  out  model  selection  and 
use  instead  background  context  to  specify  the  model.  Prediction  is  a  totally  different 
enterprise  and  is  the  subject  of  Chaps.  10-12.  In  exploratory  studies,  stepwise  and 
all  subsets  may  point  to  important  models,  but  attaching  (frequentist  or  Bayesian) 
probabilistic  statements  to  interval  estimates  is  difficult.  For  studies  somewhere 
between  pure  confirmation  and  exploratory,  one  should  attempt  to  minimize  model 
selection,  as  described  in  Sect.  4.10. 

From  a  Bayesian  or  a  frequentist  perspective,  regardless  of  the  criteria  used  in 
a  multiple  hypothesis  testing  situation,  it  is  essential  to  report  the  exact  procedure 
followed,  to  allow  critical  interpretation  of  the  results. 

We  have  seen  that  when  a  point  null,  such  as  H0  :  0  =  0,  is  tested,  then  frequentist 
and  Bayesian  procedures  may  well  differ  considerably  in  their  conclusions.  This 
is  in  contrast  to  the  testing  of  a  one-sided  null  such  as  H0  :  9  <  0;  see  Casella 
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and  Berger  (1987)  for  discussion.  We  conclude  that  hypothesis  testing  is  difficult 
regardless  of  the  frequentist  or  Bayesian  persuasion  of  the  analysis.  A  particular 
difficulty  is  how  to  calibrate  the  decision  rule;  many  would  agree  that  the  Bayesian 
approach  is  the  most  natural  since  it  directly  estimates  Pr(7T  =  0  |  y),  but  this 
estimate  depends  on  the  choices  for  the  alternative  hypotheses  (so  is  a  relative 
rather  than  an  absolute  measure)  and  on  all  of  the  prior  specifications.  The  practical 
interpretation  of  the  p- value  depends  crucially  on  the  power  (sample  size  and 
observed  covariate  distribution  in  a  regression  setting)  and  reporting  point  and 
interval  estimates  alongside  a  p- value  or  an  a  level  is  strongly  recommended. 

Model  choice  is  a  fundamentally  more  difficult  endeavor  than  estimation  since 
we  rarely,  if  ever,  specify  an  exactly  true  model.  In  contrast,  estimation  is  concerned 
with  parameters  (such  as  averages  or  linear  associations  with  respect  to  a  popula¬ 
tion)  and  these  quantities  are  well  defined  (even  if  the  models  within  which  they  are 
embedded  are  mere  approximations). 


4.12  Bibliographic  Notes 

There  is  a  vast  literature  contrasting  Bayesian  and  frequentist  approaches  to 
hypothesis  testing,  and  we  mention  just  a  few  references.  Berger  (2003)  summarizes 
and  contrasts  the  Fisherian  (p-values),  Neyman  ( a  levels),  and  Jeffreys  (Bayes 
factors)  approaches  to  hypothesis  testing,  and  Goodman  (1993)  provides  a  very 
readable,  nontechnical  commentary.  Loss  functions  more  complex  than  those 
considered  in  Sect.  4.3  are  discussed  in,  for  example,  Inoue  and  Parmigiani  (2009). 

The  running  multiple  hypothesis  testing  example  concerned  the  analysis  of 
multiple  transcripts  from  a  microarray  experiment.  The  analysis  of  such  data  has 
received  a  huge  amount  of  attention;  see,  for  example,  Kerr  (2009)  and  Efron  (2008). 


4.13  Exercises 

4.1  Consider  the  simple  situation  in  which  Yj  \  9  ~ad  N(0,  a2)  with  a2  known. 
The  MLE  9  =  Y  ~  N(0,E)  with  V  =  o2jn.  The  null  and  alternative 
hypotheses  are  H0  :  9  =  0  and  Hi  :  9  ^  0,  and  under  the  alternative,  assume 
9  ~  N(0,  W).  Consider  the  case  W  =  a2: 

(a)  Derive  the  Bayes  factor  for  this  situation. 

(b)  Suppose  that  the  prior  odds  are  PO  =  7To/(l  —  7To),  with  ttq  the  prior  on 
the  null,  and  let  R  =  L„/ L,  be  the  ratio  of  losses  of  type  II  to  type  I  errors. 
Show  that  this  setup  leads  to  a  decision  rule  to  reject  H0  of  the  form 


(4.34) 


where  Z  =  9 j\fV  is  the  usual  Z-statistic. 
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(c)  Rearrangement  of  (4.34)  gives  a  Wald  statistic  threshold  of 


^2  >  2(1  +  n) 


n 


log  \/l  +  n 


Form  a  table  of  the  p- values  corresponding  to  this  threshold,  as  a  function 
of  ttq  and  n  and  with  R  =  1.  Hence,  comment  on  the  use  of  0.05  as  a 
threshold. 

4.2  The  fc-FWER  criteria  controls  the  probability  of  rejecting  k  or  more  true  null 
hypotheses,  with  k  =  1  giving  the  usual  FWER  criteria.  Show  that  the 
procedure  that  rejects  only  the  null  hypotheses  Hi,  i  =  1, . . .  ,m  for  those 
p-values  with  pi  <  ka/m,  controls  the  fc-FWER  at  level  a. 

4.3  Prove  expression  (4.17). 

4.4  In  this  question,  an  MCMC  algorithm  for  the  Bayesian  mixture  model  described 
in  Sect.  4.6.2  will  be  derived  and  applied  to  “pseudo”  gene  expression  data  that 
is  available  on  the  book  website. 

The  three-stage  model  is: 

Stage  One: 


Stage  Two:  Hi  |  7Ti  ~ud  Bernoulli(7Ti). 


Stage  Three:  Independent  priors  on  the  common  parameters: 

p(5,  T,  7T0)  oc  1/r. 


Derive  the  form  of  the  conditional  distributions 


T2  |  S,ttq,  H 
7 To  I  T2,S,H 


Hi  |  <5,  t2,  7r0,  Hi,  «  =  TO, 


where  H  =  \H-[ , . . . ,  //,„].  The  form  for  r2  requires  a  Metropolis-Hastings 
step  (as  described  in  Sect.  3.8.2). 


Implement  this  algorithm  for  the  gene  expression  data  on  the  book  website. 


Part  II 
Independent  Data 


Chapter  5 

Linear  Models 


5.1  Introduction 

In  this  chapter  we  consider  linear  regression  models.  These  models  have  received 
considerable  attention  because  of  their  mathematical  and  computational  conve¬ 
nience  and  the  relative  ease  of  parameter  interpretation.  We  discuss  a  number  of 
issues  that  require  consideration  in  order  to  perform  a  successful  linear  regression 
analysis.  These  issues  are  relevant  irrespective  of  the  inferential  paradigm  adopted 
and  so  apply  to  both  frequentist  and  Bayesian  analyses. 

The  structure  of  this  chapter  is  as  follows.  We  begin  in  Sect.  5.2  by  describing 
a  motivating  example,  before  laying  out  the  linear  model  specification  in  Sect.  5.3. 
A  justification  for  linear  modeling  is  provided  in  Sect.  5.4.  In  Sect.  5.5,  we  discuss 
parameter  interpretation,  and  in  Sects.  5.6  and  5.7,  we  describe,  respectively, 
frequentist  and  Bayesian  approaches  to  inference.  In  Sect.  5.8,  the  analysis  of 
variance  is  briefly  discussed.  Section  5.9  provides  a  discussion  of  the  bias-variance 
trade-off  that  is  encountered  when  one  considers  which  covariates  to  include  in  the 
mean  model.  In  Sect.  5. 10,  we  examine  the  robustness  of  the  least  squares  estimator 
to  model  assumptions;  this  estimator  can  be  motivated  from  estimating  function, 
likelihood,  and  Bayesian  perspectives.  The  assessment  of  assumptions  is  considered 
in  Sect.  5.11.  Section  5.12  returns  to  the  motivating  example.  Concluding  remarks 
are  provided  in  Sect.  5.13  with  references  to  additional  material  in  Sect.  5.14. 


5.2  Motivating  Example:  Prostate  Cancer 

Throughout  this  chapter  we  use  the  prostate  cancer  data  of  Sect.  1.3.1  to  illustrate 
the  main  points.  These  data  consist  of  nine  measurements  taken  on  97  men. 
Along  with  the  response,  the  log  of  prostate-specific  antigen  (PSA),  there  are  eight 
covariates.  As  an  illustrative  inferential  question,  we  consider  estimation  of  the 
linear  association  between  log(PSA)  and  the  log  of  cancer  volume,  with  possible 
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Fig.  5.1  Log  of 

prostate-specific  antigen 
versus  log  cancer  volume, 
with  smoother  superimposed 


log(can  vol) 


adjustment  for  other  “important”  variables.  Figure  5.1  plots  log(PSA)  versus  log 
cancer  volume,  along  with  a  smoother.  The  relationship  looks  linear,  but  Figs.  1.1 
and  1.2  showed  that  log(PSA)  was  also  associated  with  a  number  of  the  additional 
seven  covariates  and  that  there  are  strong  associations  between  the  eight  covariates 
themselves.  Consequently,  we  might  question  whether  some  or  all  of  the  other  seven 
variables  should  be  added  to  the  model. 


5.3  Model  Specification 

A  multiple  linear  regression  model  takes  the  form 


Yi  —  P  0  +  PlXil  +  ■  ■  ■  +  PkXik  +  Ci)  (5.1) 

where  we  begin  by  assuming  that  the  error  terms  are  uncorrelated  with  E[ej]  =  0  and 
var(ei)  =  a2.  In  a  simple  linear  regression  model,  k  =  1  so  that  we  have  a  single 
covariate.  Linearity  here  is  with  respect  to  the  parameters,  and  so  variables  may 
undergo  nonlinear  transforms  from  their  original  scale,  before  inclusion  in  (5.1). 

In  matrix  form  we  write 

Y=x/3  +  e,  (5.2) 

where 


'll' 

'1  *11  . 

•  %1  k 

'AT 

'ei 

y2 

,  X  = 

1  *21  ■ 

•  %2k 

,  P  = 

Pi 

,  e  = 

£2 

Yn. 

.lint  • 

•  %nk  . 

3k. 

_  £n 

Y  = 
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with  E[e]  =  0  and  var(e)  =  a2 In.  We  will  also  sometimes  write 

I  i  —  Ci, 


where  xt  =  [\  xn  ...  Xik ]  for  i  =  1, . . . ,  n. 

The  covariates  may  be  continuous  or  discrete.  Discrete  variables  with  a  finite 
set  of  values  are  known  as  factors,  with  the  values  being  referred  to  as  levels.  The 
levels  may  be  ordered,  and  the  ordering  may  or  not  be  based  upon  numerical  values. 
For  example,  dose  levels  of  a  drug  are  associated  with  numerical  values  but  may  be 
viewed  as  factor  levels.  Suppose  x  represents  dose,  with  levels  0,  1,  and  5.  There  are 
two  alternative  models  that  are  immediately  suggested  for  such  an  x  variable.  First, 
we  may  use  a  simple  linear  model  in  x: 


E[Y  \x\=p0+  Pix. 


(5.3) 


Second,  we  may  adopt  the  model 


E[Y  |  a:]  =  a0  x  I(x  =  0)  +  ol\  x  I(x  =  1)  +  a-i  x  I(x  =  5),  (5.4) 


where  the  indicator  function 


and  ensures  that  the  appropriate  level  of  x  is  picked.  The  mean  function  (5.4) 
allows  for  nonlinearity  in  the  modeled  association  between  Y  and  the  observed  x 
values,  but  does  not  allow  interpolation  to  unobserved  values  of  x.  In  contrast,  (5.3) 
allows  interpolation  but  imposes  linearity.  For  an  ordinal  variable,  the  order  of 
categories  matters,  but  there  are  not  specific  values  associated  with  each  level 
(though  values  will  be  assigned  as  labels  for  computation).  An  example  of  an  ordinal 
value  is  a  pain  score  with  categories  none/mild/medium/severe.  Alternatively,  the 
levels  may  be  nominal  (such  as  female/male).  The  coding  of  factors  is  discussed  in 
Sect.  5.5.2.  Covariates  may  be  of  inherent  were  specific  interest  or  may  be  included 
in  the  model  in  order  to  control  for  sources  of  variability  or,  more  specifically, 
confounding;  Sect.  5.9  provides  more  discussion. 

The  lower-/uppercase  notation  adopted  here  explicitly  emphasizes  that  the 
covariates  x  are  viewed  as  fixed  while  the  responses  Y  are  random  variables. 
This  is  true  regardless  of  whether  the  covariates  were  fixed  by  design  or  were 
random  with  respect  to  the  sampling  scheme.  In  the  latter  case  it  is  assumed  that 
the  distribution  of  x  does  not  carry  information  concerning  (3  or  a2,  so  that  it  is 
ancillary  (Appendix  F).  Specifically,  letting  7  denote  parameters  associated  with  a 
model  for  x ,  we  assume  that 


p{y,  x  |  /3,  a2, 7)  =  p(y  \  x,  /3,  a2)  x  p(x  |  7), 


(5.5) 


so  that  conditioning  on  x  does  not  incur  a  loss  in  information  with  respect  to  (3. 
Hence,  we  can  ignore  the  second  term  on  the  right-hand  side  of  (5.5). 
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Random  covariates,  as  just  discussed,  should  be  distinguished  from  inaccurately 
measured  covariates.  We  will  assume  throughout  that  the  x  values  are  measured 
without  error ,  an  assumption  that  must  always  be  critically  assessed.  In  an  obser¬ 
vational  setting  in  particular,  it  is  common  for  elements  of  x  to  be  measured  with 
at  least  some  error,  but,  informally  speaking,  we  hope  that  these  errors  are  small 
relative  to  the  ranges;  if  this  is  not  the  case,  then  we  must  consider  so-called  errors- 
in-variables  models;  methods  for  addressing  this  problem  are  extensively  discussed 
in  Carroll  et  al.  (2006). 


5.4  A  Justification  for  Linear  Modeling 

In  this  section  we  discuss  the  assumption  of  linearity.  In  general,  there  is  no  reason 
to  expect  the  effects  of  continuous  covariates  to  be  causally  linear, 1  but  if  we  have 
a  “true”  model,  E \Y  |  a;]  =  /(:/;),  then  a  first-order  Taylor  series  expansion  about  a 
point  Xq  gives 


f(x) 


f(x 0)  + 


df_ 

dx 


( x 

io 


fa  +  fa{x  -  x0) 


£o) 


so  that,  at  least  for  x  values  close  to  xq,  we  have  an  approximately  linear 
relationship. 

As  an  example,  Fig.  5.2  shows  the  height  of  50  children  plotted  against  their 
age.  The  true  nonlinear  form  from  which  these  data  were  generated  is  the  so-called 
Jenss  curve: 

E[Y  |  x]  =  fa  +  fax  -  exp(/32  +  fax), 

where  Y  is  the  height  of  the  child  at  year  x.  This  model  was  studied  by  Jenss 
and  Bayley  (1937),  and  the  parameter  values  for  the  simulation  were  taken  from 
Dwyer  et  al.  (1983).  The  solid  line  on  Fig.  5.2  is  the  curve  from  which  these  data 
were  simulated,  and  the  dotted  and  dashed  lines  are  the  least  squares  fits  using  data 
from  ages  less  than  1.5  years  only  and  greater  than  4.5  years  only,  respectively. 
At  younger  ages,  the  association  is  approximately  linear,  and  similarly  for  older 
ages,  but  a  single  linear  curve  does  not  provide  a  good  description  over  the  complete 
age  range. 


*In  fact,  as  illustrated  in  Example  1.3.4,  many  physical  phenomena  are  driven  by  differential 
equations  with  nonlinear  models  arising  as  solutions  to  these  equations. 
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Fig.  5.2  Illustration  of  linear 
approximations  to  a  nonlinear 
growth  curve  model 


5.5  Parameter  Interpretation 

Before  considering  inference,  we  discuss  parameter  interpretation  for  the  linear 
model.  This  topic  is  of  vital  importance  in  many  settings,  in  order  to  report  analyses 
in  a  meaningful  manner.  Interpretation  is  of  far  less  concern  in  situations  in  which 
we  simply  wish  to  carry  out  prediction;  methods  for  this  endeavor  are  described 
in  Chaps.  10-12.  In  a  Bayesian  analysis  the  specification  of  informative  prior 
distributions  requires  a  clear  understanding  of  the  meaning  of  parameters. 


5.5.1  Causation  Versus  Association 

We  begin  with  the  simple  linear  regression  model 

E [Y  |  x\  =  f30  +  fax.  (5.6) 

Here  we  have  explicitly  conditioned  upon  x  which  is  an  important  distinction  since, 
for  example,  the  models 

E[Y\  =  E[E(F  |  *)]  =  A,  (5.7) 

and 

E [F  |  x]  =  A>,  (5.8) 

are  very  different.  In  (5.7)  no  assumptions  are  made,  and  we  are  simply  saying 
that  there  is  an  average  response  in  the  population.  However,  (5.8)  states  that 
the  expected  response  does  not  vary  with  x,  which  is  a  very  strong  assump¬ 
tion.  Consequently,  care  should  be  taken  to  understand  which  situation  is  being 
considered. 

We  first  consider  the  intercept  parameter  /3q  in  (5.6),  which  is  the  expected 
response  at  x  =  0.  The  latter  expectation  may  make  little  sense  (e.g.,  suppose  the 
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response  is  blood  pressure  and  the  covariate  is  weight),  and  there  are  a  number  of 
reasons  to  instead  use  the  model 

E[Y\x\  =  ffi  +  p1(x-x*),  (5.9) 

within  which  /?q  is  the  expected  response  at  ir  =  a;*.  By  choosing  x*  to  be  a 
meaningful  value,  we  will,  for  example,  be  able  to  specify  a  prior  for  /3q  more  easily 
in  a  Bayesian  analysis  (see  Sect.  3.4.2  for  further  discussion).  Choosing  x*  =  x 
is  dataset  specific  (which  does  not  allow  simple  comparison  of  estimates  across 
studies)  but  provides  a  number  of  statistical  advantages.  Of  course,  models  (5.6) 
and  (5.9)  provide  identical  inference  since  they  are  simply  two  parameterizations  of 
the  same  model. 

In  both  (5.6)  and  (5.9),  the  mathematical  interpretation  of  the  parameter  Pi  is 
that  it  represents  the  additive  change  in  the  expected  response  for  a  unit  increase 
in  x.  Notice  that  the  interpretation  of  !3-\  depends  on  the  scales  of  measurement  of 
both  x  and  Y .  More  generally,  cpi  represents  the  additive  change  in  the  expected 
response  for  a  c  unit  change  in  x.  A  difficulty  with  such  interpretations  is  that  it  is 
inviting  to  think  that  if  we  were  to  provide  an  intervention  and,  for  example,  increase 
x  by  one  unit  for  every  individual  in  a  population,  then  the  expected  response  would 
change  by  Pi.  The  latter  is  a  causal  interpretation  and  is  not  appropriate  in  most 
situations,  and  never  in  observational  studies,  because  unmeasured  variables  that 
are  associated  with  both  Y  and  x  will  be  contributing  to  the  observed  association, 
Pi,  between  Y  and  x.  In  a  designed  experiment  in  which  everything  proceeds  as 
planned,  x  is  randomly  assigned  to  each  individual,  and  we  may  interpret  pi  as 
the  expected  change  in  the  response  for  an  individual  following  an  intervention  in 
which  x  were  increased  by  one  unit.  Even  in  this  ideal  situation  we  need  to  know 
that  the  randomization  was  successfully  implemented.  It  is  also  preferable  to  have 
large  sample  sizes  so  that  any  chance  imbalance  in  variables  between  groups  (as 
defined  by  different  x  values)  is  small. 

We  illustrate  the  problems  with  a  simple  idealized  example.  Suppose  the  “true” 
model  is 


E[Y  |  x,  z]  =  p0  +  Pix  +  p2z, 

(5.10) 

E[Z  |  a:]  =  a  +  bx, 

(5.11) 

describe  the  linear  association  between  x  and  z.  Then,  if  V  is  regressed  on  x,  only 
E[Y  \x\  =Ezlx[E(Y  \x,Z)} 

=  Po  +  Pi%  +  P2E [Z  |  x] 

=  Pt  +  Ptx  (5.12) 


where 


Po  =  Po  +  aP2 
Pi  —  Pi  +  fr/^2 


(5.13) 
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showing  that,  when  we  observe  x  only  and  fit  model  (5.12),  our  estimate  of  fa 
reflects  not  just  the  effect  of  x  but,  in  addition,  the  effect  of  z  mediated  through  its 
association  with  x.  If  b  =  0,  so  that  X  and  Z  are  uncorrelated,  or  if  fa  =  0,  so  that 
Z  does  not  affect  Y,  then  there  will  be  no  bias.  Here  “bias”  refers  to  estimation  of 
fa,  and  not  to  fa.  So  for  bias  to  occur  in  a  linear  model,  Z  must  be  related  to  both  Y 
and  X  which,  roughly  speaking,  is  the  definition  of  a  confounder.  The  simulation  at 
the  end  of  Sect.  4.8  illustrated  this  phenomenon.  A  major  problem  in  observational 
studies  is  that  unmeasured  confounders  can  always  distort  the  true  association. 
This  argument  reveals  the  beauty  of  randomization  in  which,  by  construction,  there 
cannot  be  systematic  differences  between  groups  of  units  randomized  to  different 
x  levels. 

To  rehearse  this  argument  in  a  particular  context,  suppose  Y  represents  the 
proportion  of  individuals  with  lung  cancer  in  a  population  of  individuals  with 
smoking  level  (e.g.,  pack  years)  x.  We  know  that  alcohol  consumption,  z,  is 
also  associated  with  lung  cancer,  but  it  is  unmeasured.  In  addition,  X  and  Z  are 
positively  correlated.  If  we  fit  model  (5.12),  that  is,  regress  Y  on  x  only,  then 
the  resultant  fa  is  reflecting  not  only  the  effect  of  smoking  but  that  of  alcohol 
also  through  its  association  with  smoking.  Specifically,  since  b  >  0  (individuals 
who  smoke  are  more  likely  to  have  increased  alcohol  consumption),  then  (5.13) 
indicates  that  fa  will  overestimate  the  true  smoking  effect  fa.  If  we  were  to 
intervene  in  our  study  population  and  (somehow)  decrease  smoking  levels  by 
one  unit,  then  we  would  not  expect  the  lung  cancer  incidence  to  decrease  by  fa 
because  alcohol  consumption  in  the  population  has  remained  constant  (assuming 
the  imposed  reduction  does  not  change  alcohol  patterns).  Rather,  from  (5.10),  the 
expected  decrease  in  the  fraction  with  lung  cancer  will  be  fa  if  there  were  no 
other  confounders  (which  of  course  is  not  the  case).  The  interpretation  of  fa  is 
the  following.  If  we  were  to  examine  two  groups  of  individuals  within  the  study 
population  with  levels  of  smoking  of  x  + 1  and  x,  then  we  would  expect  lung  cancer 
incidence  to  be  fa  higher  in  the  group  with  the  higher  level  of  smoking. 

To  summarize,  great  care  must  be  taken  with  parameter  interpretation  in 
observational  studies  because  we  are  estimating  associations  and  not  causal  relation¬ 
ships.  The  parameter  estimate  associated  with  x  reflects  not  only  the  “true”  effect 
of  x  but  also  the  effects  of  all  other  unmeasured  variables  that  are  related  to  both  x 
and  Y. 


5.5.2  Multiple  Parameters 

In  the  model 

E [Y  |  xi, . . . ,  xk\  =  fa  +  faxi  +  . . .  +  faxk , 

the  parameter  fa  is  the  additive  change  in  the  average  response  associated  with  a 
unit  change  in  Xj,  with  all  other  variables  held  constant. 
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In  some  situations  the  parameters  of  a  model  may  be  very  difficult  to  interpret. 
Consider  the  quadratic  model: 

E [Y  |  x\  =  f3o  +  Pix  +  fax2. 

In  this  model,  interpretation  of  Pi  (and  (h)  is  difficult  because  we  cannot  change 
x  by  one  unit  and  simultaneously  hold  x2  constant.  An  alternative  parameterization 
that  is  easier  to  interpret  is  7  =  [70,71,72],  where  70  =  Po,  71  =  —P1/2P2,  and 
72  =  Po  -  01/ 4/32-  Here  71  is  the  x  value  representing  the  turning  point  of  the 
quadratic,  and  72  is  the  expected  value  of  the  curve  at  this  point. 

We  now  discuss  parameterizations  that  may  be  adopted  when  coding  factors. 
We  begin  with  a  simple  example  in  which  we  examine  the  association  between  a 
response  Y  and  a  two-level  factor  X\,  which  we  refer  to  as  gender,  and  code  as 
X\  =  0/1,  for  female/male.  The  obvious  formulation  of  the  model  is 

pry  ,,,  1  =  /  0o  +  01  'f  a-'i  =  0  (female), 

1  \P'o+02  if  an  =  1  (male). 

The  parameters  in  this  model  are  clearly  not  identifiable ;  the  data  may  be  summa¬ 
rized  as  two  means,  but  the  model  contains  three  parameters.  This  nonidentifiability 
is  sometimes  referred  to  as  (intrinsic)  aliasing,  and  the  solution  is  to  place  a 
constraint  on  the  parameters. 

In  the  sum-to-zero  parameterization,  we  impose  the  constraint  P[  +  p'2  =  0,  to 
give  the  model 


pry  I  r  1  =  /  00  -  01  if  X1  =  0  (female), 

'  '  \0o+0i  if*!  =  1  (male). 

In  this  case  E[Y  |  x]  =  x/3" ,  where  the  rows  of  the  design  matrix  are  x  =  [1,-1] 
if  female  and  x  =  [1, 1]  if  male.  We  write 

E[F]  =  E[Y  |  xi  =  0]  x  po  +E[Y  \x1  =  l]x  (1  -  p0) 

=  /?"+/?"(l-2p0), 

where  po  is  the  proportion  of  females  in  the  population.  We  therefore  see  that  is 
the  expected  response  if  p0  =  1/2,  and 

E [Y  |  *1  =  1]  -  E[Y  |  *1  =  0]  =  2p'/, 

is  the  expected  difference  in  responses  between  males  and  females. 

An  alternative  parameterization  imposes  the  corner-point  constraint  and  assigns 
f3[  =  0  so  that 


E[F  |  xi] 


/3o  if  x\  =  0  (female), 

Po  +  Pi  if  27  =  1  (male). 
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For  this  parameterization,  E[Y  |  a;]  =  x(3 ,  where  ac  =  [1, 0]  if  female  and  a;  =  [1,1] 
if  male.  In  this  model,  /?o  is  the  expected  response  for  females,  and  p\  is  the  additive 
change  in  the  expected  response  for  males,  as  compared  to  females. 

A  final  model  is 


E [Y  |  a*] 


Pi  if  xi  =  0  (female) 
p\  if  x\  =  1  (male). 


In  this  case  E[F  |  x]  =  x(3 ^  where  x  =  [1,  0]  if  female  and  x  =  [0, 1]  if  male  so  that 
Pq  is  the  expected  response  for  a  female  and  p*  is  the  expected  response  for  a  male. 
We  stress  that  inference  for  each  of  the  formulations  is  identical;  all  that  changes  is 
parameter  interpretation. 

The  benefits  or  otherwise  of  alternative  parameterizations  should  be  considered 
in  the  light  of  their  extension  to  the  case  of  more  than  two  levels  and  to  multiple 
factors.  For  example,  the  [/3q  ,  parameterization  does  not  generalize  well  to 
a  situation  in  which  there  are  multiple  factors  and  we  do  not  wish  to  assume 
a  unique  mean  for  each  combination  of  factors  (i.e.,  a  non-saturated  model).  It 
is  obviously  important  to  determine  the  default  parameterization  adopted  in  any 
particular  statistical  package  so  that  parameter  interpretation  can  be  accurately 
carried  out. 

In  this  book  we  adopt  the  corner-point  parameterization.  Unlike  the  sum-to- 
zero  constraint,  this  parameterization  is  not  symmetric,  since  the  first  level  of 
each  factor  is  afforded  special  status,  but  parameter  interpretation  is  relatively 
straightforward.  If  possible,  one  should  define  the  factors  so  that  the  first  level  is 
the  most  natural  “baseline.”  We  illustrate  the  use  of  this  parameterization  with  an 
example  concerning  two  factors,  X\  and  X2,  with  X\  having  3  levels,  coded  as  0,  1, 
2,  and  x 2  having  4  levels  coded  as  0,  1, 2,  3.  The  coding  for  the  no  interaction  (main 
effects2  only)  model  is 


E [Y  |  xi,x2] 


/i  if  x’i  =  0,  x2  =  0, 

H  +  aj  if  x\  =  j,j  =  1,  2,  x2  =  0, 

H  +  Pk  if  x\  =  0,x2  =  k,k  =  1,2,3, 

/ i  +  otj  +  Pk  ifxi  =  j,j  =  1,  2,  x2  =  fc,  k  =  1,2,3. 


As  shorthand,  we  write  this  model  as 


E[F  |  X\  =  j,x2  =  k]  =  n  +  aj  x  I(x  1  =  j)  +  ftxl(i2  =  k ), 
fory  =  0, 1,  2,  k  =  0, 1,  2, 3,  with  «o  =  Po  =  0. 


2This  terminology  is  potentially  deceptive  since  “effects”  invite  a  causal  interpretation. 
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Xi  *1 

Fig.  5.3  Expected  values  for  various  models  with  two  binary  factors  x\  and  X2,  “x”  represents 
X2  =  0  and  “o”  X2  =  1:  (a)  Null  model,  (b)  x\  main  effect  only,  (c)  X2  main  effect  only,  (d)  x\ 
and  X2  main  effects,  (e)  interaction  model  1,  (f)  interaction  model  2.  The  dashed  lines  in  panels 
(e)  and  (f)  denote  the  expected  response  under  the  main  effects  only  model 


When  one  or  more  of  the  covariates  are  factors,  interest  may  focus  on  interac¬ 
tions.  To  illustrate,  suppose  first  we  have  two  binary  factors,  x-i  and  x2  each  coded 
as  0, 1.  The  most  general  form  for  the  mean  is  the  saturated  model 

E[Y  |  xi,x2]  =  fi  +  a  1  x  I(x  1  =  1)  +  fa  x  I(x2  =  1)  +  7n  x  I(xi  =  l,x2  =  1) 

(5.14) 

where  we  have  four  unknown  parameters  and  the  responses  may  be  summarized 
as  four  mean  values.  Figure  5.3  shows  a  variety  of  scenarios  that  may  occur  with 
this  model.  Panel  (a)  shows  the  null  model  in  which  the  response  does  not  depend 
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Table  5.1  Comer-point  notation  for  two-factor  model  with  interaction 


%2 

0 

1 

2 

3 

Xi 

0 

p 

P  +  0i 

P  +  02 

P  +  0  3 

1 

p  +  a  1 

p  +  a  1  +  /3i  +  711 

M  +  Ql  +02+  712 

/I  +  Ol  +  03  +  713 

2 

p  +  a2 

p  +  a  2  +0i  +  721 

jU  +  OL2  +  +  722 

p  +  a  2  +  03  +  723 

on  either  variable,  and  panels  (b)  and  (c)  main  effects  due  to  x\  only  and  to 
only,  respectively.  In  panel  (d)  the  response  depends  on  both  factors  in  a  simple 
additive  main  effects  only  fashion  (which  is  characterized  by  the  parallel  lines  on 
the  plot).  The  association  with  X2  is  the  same  for  both  levels  of  x\  and  711  =  0 
in  (5.14).  Panels  (e)  and  (f)  show  two  different  interaction  scenarios.  In  panel  (e), 
when  x\  =  1  and  X2  =  1  simultaneously,  the  expected  response  is  greater  than 
that  predicted  by  the  main  effects  only  model  (which  is  shown  as  a  dashed  line). 
In  panel  (f),  the  effect  of  the  interaction  is  to  reduce  the  association  due  to  X2- 
For  the  x\  =  0  population,  individuals  with  X2  =  1  have  an  increased  expected 
response  over  individuals  with  X2  =  0.  In  the  x\  =  1  population,  this  association  is 
reversed.  In  the  saturated  model  (5.14),  711  is  measuring  the  difference  between  the 
average  in  the  xi  =  1,  X2  =  1  population  and  that  predicted  by  the  main  effects  only 
model.  In  the  saturated  model,  a.\  is  the  expected  change  in  the  response  between 
the  x\  =  i  and  the  X\  =  0  populations  when  X2  =  0,  ai  +  711  is  this  same 
comparison  when  X2  =  1  ■ 

In  this  example  we  have  a  two-way  (also  known  as  first-order )  interaction  (a 
terminology  that  extends  in  an  obvious  fashion  to  three  or  more  factor).  If  an 
interaction  exists  in  a  model,  then  all  main  effects  that  are  involved  in  the  interaction 
will  often  be  included  in  the  model,  which  is  known  as  the  hierarchy  principle 
(see  Sect.  4.8  for  further  discussion).  Following  this  principle  aids  in  interpretation, 
but  there  are  situations  in  which  one  would  not  restrict  oneself  to  this  subset  of 
models.  For  example,  in  a  prediction  setting  (Chaps.  12-10),  we  may  ignore  the 
hierarchy  principle. 

Table  5.1  illustrates  the  corner-point  parameterization  for  the  case  in  which  there 
are  two  factors  with  three  and  four  levels  and  all  two-way  interactions  are  present. 
The  main  effects  model  is  obtained  by  setting  7^  =  0  for  j  =  1, 2,  k  =  1, 2,  3.  This 
notation  extends  to  generalized  linear  models,  as  we  see  in  Chap.  6. 


5.5.3  Data  Transformations 

Model  (5.1)  assumes  uncorrelated  errors  with  constant  variance.  If  there  is  evidence 
of  nonconstant  variance,  the  response  may  be  transformed  to  achieve  constant 
variance,  though  this  changes  other  characteristics  of  the  model.  Historically,  this 
was  a  popular  approach  due  to  the  lack  of  easily  implemented  alternatives  to 
the  linear  model  with  constant  variance,  and  it  is  still  useful  in  some  instances. 
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For  example,  for  positive  data  taking  the  log  transform  and  fitting  linear  models  is 
a  common  strategy.  An  alternative  approach  that  is  often  preferable  is  to  retain  the 
mean-variance  relationship  and  model  on  the  original  scale  of  the  response  (using 
a  generalized  linear  model,  for  example  see  Chap.  6). 

Suppose  we  have 

E[Y]  =  Hy 

and 

var(F)  =  a2yg(gv), 

so  that  the  mean-variance  relationship  is  determined  by  g(-),  which  is  assumed 
known,  at  least  approximately.  Consider  the  transformed  random  variable,  Z  = 
h{Y).  Taking  the  approximation 

Z  ft  h([iy)  +  (1  —  /)h  {gy), 

where  h'(gy)  =  ^-|  ,  produces 

E[Z]  ft  h(fly), 

and 

var (Z)  ft  <j2vg(iiy)ti (nv)2 . 

To  obtain  independence  between  the  variance  and  the  mean,  we  therefore  require 

h(-)  =  j  g{y)~1/2  dy.  (5.15) 

For  example,  a  commonly  encountered  relationship  for  positive  responses  is 
var(Y)  =  cr2fi2,  so  that  the  coefficient  of  variation  (which  is  the  standard  deviation 
divided  by  the  mean)  is  constant.  In  this  case,  the  suggested  transformation, 
from  (5.15),  is  Z  =  logY.  As  a  second  example,  if  var(Y)  =  a ygy,  the 
recommended  transformation  is  Z  =  \/Y . 

Transformations  of  Y,  and/or  covariates,  may  also  be  taken  in  order  to  obtain  an 
approximately  linear  association,  though  it  is  advisable  to  do  this  before  seeing  the 
scatterplot  of  y  versus  x,  since  data  dredging  is  a  bad  idea,  as  discussed  in  Sect.  4. 10. 

Parameter  interpretation  is  usually  less  straightforward  if  we  have  transformed 
the  response  and/or  the  covariates,  as  we  illustrate  with  a  series  of  examples.  In  this 
section,  for  clarity,  we  explicitly  state  the  base  of  the  logarithm.  Suppose  we  fit 
the  model 

loge  Y  =  Po  +  pix  +  e,  (5.16) 

or  equivalently 


Y  =  exp(/30  +  fax  +  e)  =  exp(/30  +  f3ix)5, 


(5.17) 
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where  S  =  exp(e).  The  expectation  of  Y  depends  on  the  distribution  of  e,  but  the 
median  of  Y  |  x  is  exp(/3o  +  /3±x),  so  long  as  the  median  of  e  is  zero.  It  will  often 
be  more  appropriate  to  report  associations  in  terms  of  the  median  for  a  positive 
response;  exp(/?o)  is  the  median  response  when  x  =  0,  and  exp(/3i)  is  the  ratio 
of  median  responses  corresponding  to  a  unit  increase  in  x.  We  may  interpret  the 
intercept  in  terms  of  the  expected  value  for  specific  distributional  choices  for  e. 
For  example,  if  e  \  x  ~  N(0,  a2),  then  since  Y  is  lognormal  (Appendix  D), 

E [Y  |  x ]  =  exp(/30  +  fax  +  a2 /2), 


giving  E [Y  \  x  =  0]  =  exp(/3o  +  cr2/2)  and 


E[Y  |  x  +  1] 
E[Y  |  x] 


exp(/?i), 


(5.18) 


so  that  exp(/3i)  can  be  interpreted  as  the  ratio  of  expected  responses  between 
subpopulations  whose  x  values  differ  by  one  unit.  The  interpretation  (5.18)  is  true 
for  other  distributions,  so  long  as  E[exp(e)  |  x\  does  not  depend  on  x.  In  general, 
if  (5.18)  holds,  exp(c/?i)  is  the  ratio  of  expected  responses  between  subpopulations 
with  covariate  values  x  +  c  and  x.  An  alternative  interpretation  follows  from 
observing  that 

^E [Y  |  x]  =  ftE [Y  |  x], 

so  that  the  rate  of  change  of  the  mean  function  with  respect  to  x  is  proportional  to 
the  mean,  with  proportionality  constant  (3i . 

Model  (5.16),  with  the  assumption  of  normal  errors,  is  useful  if  the  standard 
deviation  on  the  original  scale  is  proportional  to  the  mean  (to  give  a  constant 
coefficient  of  variation)  since,  evaluating  the  variance  of  a  lognormal  distribution 
(Appendix  D), 

var (Y  |  x)  =  E[Y  |  x]2  [exp(cr2)  —  l]  , 
and  if  a2  is  small,  exp(cr2)  «  1  +  cr2,  and  so 

var(F  |  x)  «  E [Y  \  x]V, 


showing  that  for  this  model  we  have,  approximately,  a  constant  coefficient  of 
variation.  Hence,  log  transformation  of  the  response  is  often  useful  for  strictly 
positive  responses,  which  ties  in  with  the  example  following  (5.15). 

A  model  that  looks  similar  to  (5.17)  is 

Y  =  E[Y  |  x]  +  e  =  exp(/30  +  /3ix)  +  e.  (5.19) 


In  this  model  we  have  additive  errors,  whereas  in  the  previous  case,  the  errors  were 
multiplicative.  For  the  additive  model,  exp(/3o)  is  the  expected  value  at  x  =  0,  and 
exp(/3i)  is  the  ratio  of  expected  responses  between  subpopulations  whose  x  values 
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differ  by  one  unit,  regardless  of  the  error  distribution  (so  long  as  it  has  zero  mean). 
In  model  (5.19),  we  may  question  whether  additive  errors  are  reasonable  given  that 
the  mean  function  is  always  positive,  though  if  the  responses  are  well  away  from 
zero,  there  may  not  be  a  problem.  Model  (5.19)  is  nonlinear  in  the  parameters, 
whereas  (5.16)  is  linear,  which  has  implications  for  inference  and  computation,  as 
discussed  in  Chap.  6. 

We  now  consider  the  model 

Y  =  Po  +  /?i  log10  x  +  e  (5.20) 

which  can  be  useful  if  linearity  of  the  mean  is  reasonable  on  a  log  scale.  For 
example,  if  we  have  dose  levels  of  a  drug  a;  of  1,  10,  100,  and  1,000,  then  we  would 
be  very  surprised  if  changing  x  from  1  to  2  produces  the  same  change  in  the  expected 
response  as  increasing  x  from  1,000  to  1,001.  Modeling  on  the  original  scale  might 
also  result  in  extreme  x  values  that  are  overly  influential,  though  the  appropriateness 
of  the  description  of  the  relationship  between  Y  and  x  should  drive  the  decision  as  to 
which  scale  to  model  on.  For  model  (5.20),  the  obvious  mathematical  interpretation 
is  that  /?i  represents  the  difference  in  the  expected  response  for  individuals  whose 
log10  x  values  differ  by  one  unit.  A  more  substantive  interpretation  follows  from 
observing  that 

E[Y  |  cx]  -  E[Y  \x]  =  /31  log10  c 

so  that  for  a  c  =  10-fold  increase  in  x ,  the  expected  responses  differ  by  /3\. 
Therefore,  taking  log10  x  gives  an  associated  coefficient  that  gives  the  same  change 
in  the  average  when  going  from  1  to  10,  as  when  going  from  100  to  1,000. 

Similarly,  if  we  consider  a  linear  model  in  log2  x,  then  k/3 1  is  the  additive 
difference  between  the  expected  response  for  two  subpopulations  with  covariates 
2kx  and  x.  For  example,  if  one  subpopulation  has  twice  the  covariate  of  another, 
the  difference  in  the  expected  response  is  fii.  In  general,  if  we  reparameterize  via 
loga  x  (to  give  /?i  as  the  change  corresponding  to  an  a-fold  change),  then  the  effect 
of  a  6-fold  change  is  /?i  logQ  b.  As  an  example,  if  we  initially  assume  the  model 

E[Y  \x]=p0  +  /31  loge  x, 

then  /3\  loge  10  =  2.30  x  f3\  is  the  expected  change  for  a  10-fold  change  in  x. 

We  now  consider  the  model  with  both  Y  and  x  transformed 

loge  Y  =  Po  +  Pi  log10  x  +  e. 

Under  this  specification,  exp(/3i)  represents  the  multiplicative  change  in  the  median 
response  corresponding  to  a  10-fold  increase  in  x. 
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Example:  Prostate  Cancer 

For  the  prostate  data,  a  simple  linear  regression  model  that  does  not  adjust  for 
additional  variables  is 

log(PSA)  =  (3q  +  Pi  x  loge (can  vol)  +  e 

where  the  errors  e  are  uncorrelated  with  E[e]  =  0  and  var(e)  =  a2.  In  this  model, 
exp(/3i)  is  the  multiplicative  change  in  median  PSA  associated  with  an  e-fold 
change  in  cancer  volume.  Perhaps  more  usefully,  2.30  x  /3i  is  the  multiplicative 
change  in  median  PSA  associated  with  a  10-fold  increase  in  cancer  volume. 


5.6  Frequentist  Inference 
5. 6. 1  Likelihood 


Consider  the  model 

Y  =  x/3  +  e, 

withe  ~  N„(0,  (72In),  x  =  [1,  xi, . . . ,  xk\,  and  f3  =  [/30,  ft, . . . ,  /3k}T-  The 
complete  parameter  vector  is  0  =  [{3,  o\  and  is  of  dimension  p  x  1  where  p  =  k  +  2. 
The  likelihood  function  is 


L(0)  =  (2na2)  n^2  exp 


1 

2^ 


(y  -  xf3)T(y 


with  log  likelihood 


71  1 

1(0)  =  -tylog(27rcr2)  -  2^2  (y  -  x(3)T(y  -  x(3), 

which  yields  the  score  equations  (estimating  functions) 


dl  1 

Si(°)  =  7^  =  ~^xT(Y-xP)  (5-21) 

op  oz 

dl  in  1 

S2(0)  =  —  =  —  +  —{Y  -  xpy(Y  -  x/3).  (5.22) 

0(7  (7  (7 6 

Setting  (5.21)  and  (5.22)  to  zero  (and  assuming  xTx  is  of  full  rank)  gives  the  MLEs 
3  =  (x'x)~1xTY 


—  (Y  —  xf3)T(Y 
n 
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We  now  examine  the  properties  of  these  estimators,  beginning  with  /3: 

E[/3]  =  (x1  x)~l  xTYL\Y] 

=  ( xTx)~1xTx/3 

=  /3 

so  that  /3  is  an  unbiased  estimator  for  all  n.  Though  S2  is  an  unbiased  estimating 
function,  a  is  a  nonlinear  function  of  S2  and  so  has  finite  sample  bias  (but  is 
asymptotically  unbiased). 

Asymptotic  variance  estimators  are  obtained  from  the  information  matrix: 


m  =  -e 


'dS' 

Ill  1 12 

d0 

_I21  I2 2  . 

where  S  =  [Si,  S2]T,  and 


In 

1 12 
1 22 


XX 
2~~ 


dS  i 

~dj3  a 

dSi 
da 
2  n 


T'  — 

121  — 


=  0 


dS2 

da 


Taking  var(0)  =  1(9)  1  gives 


var(/3)  =  a2(xTx)  1 
cr 2 


In  practice,  a2  is  replaced  by  its  estimator  to  give 

var(/3)  =  a2(xTx)~1 
V"W  = 

For  /3  to  be  unbiased,  we  need  only  assume  E[Y”  |  x]  =  x(3,  while  for  var(/3)  = 
a2(xTx)~1,  we  require  var (Y  \  x)  =  a2 I„,  but  not  normality  of  errors.  The 
expression  for  the  variance  is  also  exact  for  finite  n. 

The  asymptotic  distribution  of  the  MLE  based  on  n  observations,  (3n ,  is 

(xTx)1^20n  —  (3)  Nfc+1(0,  (j2Ifc+i)) 


(5.23) 
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and  (by  Slutsky’s  theorem.  Appendix  G)  is  still  valid  if  a  is  replaced  by  a  consistent 
estimator.  It  should  be  stressed  that  normality  of  Y  is  not  required,  just  n  sufficiently 
large  for  the  central  limit  theorem  to  apply.  Since  (3n  is  a  linear  combination 
of  independent  observations,  the  central  limit  theorem  may  be  directly  applied. 
Another  way  of  viewing  this  asymptotic  derivation  is  of  replacing  the  likelihood 

p(y  I  P)  byp(3„  |  (3). 

For  a  to  be  asymptotically  unbiased,  we  require  var(Y  |  a;)  =  a2In,  so  that  the 
estimating  function  for  a,  (5.22),  is  unbiased.  For  var(rr)  =  a2 /2n  to  hold,  we  need 
the  third  and  fourth  moments  to  be  correct  and  equal  to  zero  and  a2,  respectively, 
as  with  the  normal  distribution.  The  dependence  on  higher-order  moments  results  in 
inference  for  a  being  intrinsically  more  hazardous  than  inference  for  f3. 

Intervals  for  Bj ,  the  yth  components  of  f3,  are  based  upon  the  statistic 

Pj  ~  ft j 
s MPj)  ’ 


where  the  standard  error  in  the  denominator  is  a  times  the  square  root  of  the  (j,  j)th 
element  of  ( xTx )_1.  The  robustness  to  non-normality  of  the  data  is  in  part  due  to 
the  standardization  via  the  estimated  standard  error.  In  particular,  we  only  require 
a  —}p  a.  An  asymptotic  100  x  (1  —  a)%  confidence  interval  for  Bj  is 

Bj  ±  za/2  x  fe.(Pj) 


where  zai 2  =  <&(a/2). 

If  we  wish  to  make  inference  about  a2,  then  we  might  be  tempted  to  construct  a 
confidence  interval  for  a2  by  leaning  on  ej  |  a2  <^iid  N(0,  a2).  This  leads  to 


RSS  2 

^2  ^  \n—k—  1? 


(5.24) 


where  RSS  =  ~  xiP)2  is  the  residual  sum  of  squares.  Intervals  obtained 

in  this  manner  are  extremely  non-robust  to  departures  from  normality;  however, 
see  van  der  Vaart  (1998,  p.  27).  The  chi-square  statistic  does  not  standardize  in  any 
way,  and  any  attempt  to  do  so  would  require  an  estimate  of  the  fourth  moment  of 
the  error  distribution,  an  endeavor  that  will  be  difficult  due  to  the  inherent  variability 
in  an  estimate  of  the  kurtosis  (for  a  normal  distribution,  the  kurtosis  is  zero,  and  so 
we  do  not  require  an  estimate).  Consequently,  an  interval  (or  test)  based  on  (5.24) 
should  not  be  used  in  practice  unless  we  have  strong  evidence  to  suggest  normality 
(or  close  to  normality)  of  errors. 

If  the  errors  are  such  that  e  |  a2  ~  N„(0,  cr2I„),  then  combining  (5.23) 
with  (5.24)  gives,  using  (E.2)  of  Appendix  E,  the  distribution 


3  ~  Tfc+1  [f3,  s2(xTx )  \  n  -  k  -  l]  , 


(5.25) 
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a  (k  +  1) -dimensional  Student’s  t  distribution  with  location  /3,  scale  matrix 
s2(x1x)~1 ,  and  n—k—1  degrees  of  freedom  (Sect.  D).  A  100  x  (1  —  a)%  confidence 
interval  for  Qj  follows  as 

where  is  the  a/2  percentage  point  of  a  standard  t  random  variable  with 

n  —  k—1  degrees  of  freedom.  A  more  reliable  approach  to  the  construction  of 
confidence  intervals  for  elements  of  /3  is  to  use  the  bootstrap  or  sandwich  estimation, 
though  if  n  is  small,  the  latter  are  likely  to  be  unstable.  For  small  n,  a  Bayesian 
approach  may  be  taken,  though  there  is  no  way  that  the  distributional  assumption 
made  for  the  data  (i.e.,  the  likelihood)  can  be  reliably  assessed. 

We  have  just  discussed  the  non-robustness  of  (5.24)  to  normality.  It  is  perhaps 
surprising  then  that  confidence  intervals  constructed  from  (5.25)  are  used,  since  they 
are  derived  directly  from  (5.24).  However,  the  resultant  intervals  are  conservative  in 
the  sense  that  they  are  wider  than  those  constructed  from  (5.23),  explaining  their 
widespread  use. 

For  a  test  of  Hq  :  /3j  =  c,  j  =  1, . . . ,  k,  we  may  derive  a  i-test.  Under  Hq, 

T=4t (5-26) 
S)/2a 

where  Sj  is  the  (j,  j )th  element  of  ( xTx)~ 1  and  Tn-k- i  denotes  the  univariate 
t  distribution  with  n—k  —  1  degrees  of  freedom,  location  0j,  and  scale  Sji j2. 
Although  ix  can  be  very  unstable,  (5.26)  it  is  an  example  of  a  self-normalized  sum 
and  so  is  asymptotically  normal  (Gine  et  al.  1997).  The  test  with  c  =  0  is  equivalent 
to  the  partial  F  statistic 

F  _  FSS(ft  I  A,,. 

RSS(/3)/(n  —  k-  1) 

where  RSS(/3)  is  the  residual  sum  of  squares  given  the  regression  model  E[Y  |  x]  = 
x/3  and  the  fitted  sum  of  squares 

FSS  (Pj  |  ...,Pk)  =  RSS(A),-.  ■  ■  ■ ,Pk )  -RSS(/9), 

is  equal  to  the  change  in  residual  sum  of  squares  when  pj  is  dropped  from  the 
model.  The  “partial”  here  refers  to  the  occurrence  of  Pi,  l  ^  j  in  the  model.  Under 
H0,  F  ~  Fi^n-k-i-  The  link  with  (5.26)  is  that  F  =  T2  with  T  evaluated  at  c  =  0. 

Let/3  =  [/31;/32]  be  a  partition  with  /3:  =  \p0,  ...,pq)  and/32  =  \Pq+i, .  ■ .  ,Pk\, 
with  0  <  q  <  k.  Interest  may  focus  on  simultaneously  testing  whether  a  set  of 
parameters  is  equal  to  zero,  via  a  test  of  the  null 

Hq  :  /3:  unrestricted,  /32  =  0  versus  Hi  :  (3  =  [/31,/32]  ^  [/31;0]. 
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Under  Hq,  the  partial  F  statistic 

FSS(0q+1,...,/3k  \Po,0i,...,Pg)/(k-q)  FSS(/32|  P1)/(k-q) 

RSS/(n  —  k  —  1)  RSS/(n  —  A;  —  1)  1  ’ 

is  distributed  as  Fk-q,n-k- 1  (Appendix  D).  Note  that 

FSS(/32  |  00  ^  FSS(/32), 

unless  [a:i, ,  xg]  is  orthogonal  to  [xq+i, . . . ,  a;*].  Such  derivations  are  crucial  to 
the  mechanics  of  analysis  of  variance  models,  which  we  describe  in  Sect.  5.8. 

Extending  the  above  with  q  =  —1  so  that  all  k  +  1  parameters  are  being 
considered,  the  100  x  (1  —  a)%  confidence  interval  for  (3  is  the  ellipsoid 

((3  -  3)t£ct£c(/3  -  3)  <  (k  +  l)s2Ffe+iin_fc_i(l  -  a)  (5.28) 

where  s2  =  RSS/(n  —  k  —  1)  and  Fk+i,n-k-i{3  —  ct)  is  the  1  —  a  point  of  the 
F  distribution  with  k  +  l,n—  k  —  1  degrees  of  freedom.  The  total  sum  of  squares 
(TSS)  may  be  partitioned  as 

TSS  =  (y  -  xpy(y  -  xp) 

=  (y  —  xf3  +  xf3  —  x/3)T(y  —  x/3  +  x/3  —  x/3) 

=  (y  ~  xf3)T(y  -  x/3)  +  ((3-  pyxTx([3  -  3) 

=  RSS  +FSS. 

Such  expressions  are  specific  to  the  linear  model. 

We  now  consider  prediction  of  both  an  expected  and  an  observed  response. 
The  latter  require  consideration  of  what  we  term  measurement  error,  though  we 
recognize  that  the  errors  in  the  model  in  general  represent  not  only  discrepancies 
arising  from  the  measurement  instrument  but  all  manner  of  additional  errors  and 
sources  of  model  misspecification.  For  inference  concerning  the  expected  response 
at  covariate  vector  £c0>  we  define  0  =  Xq/3.  Then  9  =  Xq (3  and  under  correct  first 
and  second  moment  specification  and  via  the  central  limit  theorem: 

[x^x^x^ }~1,2{0n-0)  -+d  N(0,  a2)  (5.29) 

from  which  confidence  intervals  may  be  constructed.  For  prediction  of  an  obser\>ed 
response  at  Xq,  we  define  (f>  =  Xq (3  +  e  with  estimator  (f>  =  Xq/3  +  'e.  It  is  now 
crucial  to  make  a  distributional  assumption  for  the  errors.  Under  e  ~  N(0,  a2), 

[1  +  Xoix1  x)-1  xl]-1/2 (4>  —  4>)  ~  N(0,  a2).  (5.30) 

The  accuracy  of  intervals  based  on  this  form  will  be  extremely  sensitive  to  the 
normality  assumption. 
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5.6.2  Least  Squares  Estimation 

We  describe  an  intuitive  method  of  estimation  with  a  long  history  and  attractive 
properties.  In  ordinary  least  squares,  the  estimator  is  chosen  to  minimize  the  residual 
sum  of  squares 


RSS(/3)  =  -  ajj/3)2  =  (Y  -  x(3)T(Y  -  x(3). 

2=1 


Differentiation  (and  scaling  for  convenience)  gives 

-~RSS  =  G((3)=xT(Y-x{3) 


(5.31) 


with  solution 

(3  =  (xTx)~1xTY , 

so  long  as  xTx  is  of  full  rank.  If  we  assume  E [Y  \  x]  =  x/3,  then  E[G(/3)]=0 
and  so  (5.31)  corresponds  to  an  estimating  equation,  and  we  may  apply  the 
nonidentically  distributed  version  of  Result  2.1,  summarized  in  (2. 1 3),  with 


An=E 


'dG' 

3/3 


T 

=  —x  X 


Bn  =  var(G)  =  a:Tvar(Y')a;. 


Consequently,  to  obtain  the  variance-covariance  matrix  of  /3,  we  need  to  specify 
var(y).  Assuming  var(Y)  =  <r2I„  gives  B  =  a2xTx  and 

(xTx)1/20  -  (3)  -+d  Nfc+i(0,  cr2Ifc+i). 


More  generally,  sandwich  estimation  may  be  applied,  as  we  discuss  in  Sect.  5.6.4. 

In  the  method  of  generalized  least  squares ,  we  assume  E[Y  |  x]  =  x/3  and 
var(Y  |  x)  =  a2  V  where  V  is  a  known  matrix  (weighted  least  squares  corresponds 
to  diagonal  V)  and  consider  the  function 

RSSg(/3)  =  (Y-  Xpyv-^Y  -  x/3). 

Minimization  of  RSS G(/3)  yields  the  estimating  function 

GG(f3)=xTV~1(Y-x(3), 


and  corresponding  estimator 

3g  =  (xTV-1x)-1xtV~1Y , 
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with  asymptotic  distribution 

{xTV-1x)1^0a-l3)  Nk+1(0,a2lk+1).  (5.32) 


This  estimator  also  arises  from  a  likelihood  with  e  ~  N  „.(().  a2V)  with  V  =  I„ 
giving  the  ordinary  least  squares  estimator,  as  expected.  An  unbiased  estimator  of 
a2  is 


- — —  (Y-xpyV-^Y-xp),  (5.33) 

n  —  k  —  1 


(see  Exercise  5.1)  and  may  be  substituted  for  n2  in  (5.32). 

Given  a  particular  dataset  with  n  cases,  a  natural  question  is  as  follows:  What 
is  the  practical  significance  of  a  central  limit  theorem  and  the  associated  regularity 
conditions?  In  the  simple  linear  regression  context,  we  require 


max  (xi  —  x)2 /  y  (xj  —  x)2  — >  0,  (5.34) 


as  n  — >  oo.  Intuitively,  the  imaginary  way  in  which  the  number  of  data  points 
is  going  to  infinity  is  such  that  no  single  x  value  can  dominate.  In  Sect.  5.10  we 
will  present  a  number  of  simulations  showing  the  behavior  of  the  least  squares 
estimator  as  a  function  of  n,  the  distribution  of  the  errors,  and  the  distribution  of 
the  x  values.  Such  simulations  give  one  an  indication  of  when  asymptotic  normality 
“kicks  in.”  The  required  conditions  indicate  the  sorts  of  x  distributions  that  are 
more  or  less  desirable  for  valid  asymptotic  inference.  A  crucial  observation  is  that 
reliable  asymptotic  inference  via  (5.32)  requires  the  mean-variance  relationship  to 
be  correctly  specified.  We  now  present  a  theorem  that  provides  one  justification  for 
the  use  of  the  least  squares  estimator. 


5.6.3  The  Gauss-Markov  Theorem 

Definition.  The  best  linear  unbiased  estimator  (BLUE)  of  (3: 

•  Is  a  linear  function  of  Y,  so  that  the  estimator  can  be  written  BTY,  for  an  n  x 
(k  +  1)  matrix  B 

•  Is  unbiased  so  that  E[BTY]  =  (3 

•  Has  the  smallest  variance  among  all  linear  estimators 

We  now  state  and  prove  a  celebrated  theorem. 

The  Gauss-Markov  Theorem:  Consider  the  linear  model  E[Y]  =  x/3,  where  Y 
is  n  x  1,  x  is  n  x  (fc  +  1),  and  (3  is  (fc  +  1)  x  1.  Suppose  further  that  cov(Y)  =  er2In. 
Then  (3  =  (xTx)~1xTY  is  the  best  linear  unbiased  estimator  (BLUE)  of  cT/3. 
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Proof.  The  estimator  (3  =  (xTx)~1xTY  is  clearly  linear,  and  we  have  already 
shown  it  is  unbiased.  We  therefore  only  need  to  show  the  variance  is  smallest  among 
linear  unbiased  estimators. 

Let  /3  =  AY  be  another  linear  unbiased  estimator  with  A  a  (fc  +  1)  x  n  matrix. 
Since  the  estimator  is  unbiased,  E [j3]  =  AL[K]  =  Ax (3  for  any  (3 ,  which  implies 
Ax  —  Ifc+i.  Now 

var(/3)  —  var(/3)  =  Aa2Ik+iAT  —  cr2(a;T£e)_1 

=  cr2  [AAT  -  Axtx'x^x'A1]  . 

At  this  point  we  define  h  =  x(xTx)~1xT,  which  is  known  as  the  hat  matrix  (see 
Sect.  5.1 1.2).  The  hat  matrix  is  symmetric  and  idempotent  so  that  h1  =  h  and 
hhT  =  h.  Further,  I„  —  h  inherits  these  properties.  Using  these  facts,  we  can  write 

var(/3)  —  var(/3)  =  cr2A(In  —  h)AT 

=  cr2A(In  —  h)(In  —  h)TAT 

and  this  (fc  +  1)  x  (k  +  1)  matrix  is  positive  definite,  establishing  that  (3  has  the 
smallest  variance  among  linear  unbiased  estimators.  □ 

This  result  shows  that  (3,  which  is  the  least  squares  estimate,  the  maximum 
likelihood  estimate  with  a  normal  model,  and  the  Bayesian  posterior  mean  with 
normal  model  and  improper  prior  7r(/3,  cr2)  oc  o~2  (as  we  show  in  Sect.  5.7),  is 
optimal  among  linear  estimators.  We  emphasize  that,  in  the  above  theorem,  only 
first  and  second  moment  assumptions  were  used  with  no  distributional  assumptions 
being  required. 


5.6.4  Sandwich  Estimation 

We  have  already  examined  the  properties  of  the  ordinary  least  squares/maximum 
likelihood  estimator  (3  =  (a Px)~1xY  and  have  seen  that  var(/3)  =  (a;Ta;)_1cr2,  if 
var (Y  |  x)  =  o2In.  Suppose  that  the  correct  variance  model  is  var(Y  |  x)  =  o2V 
so  that  the  model  from  which  the  estimator  was  derived  was  incorrect.  Then  the 
estimator  is  still  unbiased,  but  the  appropriate  variance  estimator  is 

var(/3)  =  {xTx)-1xT\xc(Y  \  x)x(xTx)~1 

=  (xTx)~1xTVx(xTx)~1a2 ,  (5.35) 

Expression  (5.35)  can  also  be  derived  directly  from  the  estimating  function 


G(/3)  =  xT(Y  —  x/3), 
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since  we  know 


(a-1b„a;-1)1/2(3-/3)  d  Nfc+1(o„, i„), 


where 


Bn  =  var(G)  =  xTVxcr2 


to  give 

var(/3)  =  (xtx)-1xjVx(xjx)-1(t2  . 

We  now  describe  a  sandwich  estimator  of  the  variance  that  relaxes  the  constant 
variance  assumption  but  assumes  uncorrelated  responses.  When  the  variance  is 
not  constant,  the  ordinary  least  squares  estimator  is  consistent  (since  the  mean 
specification  is  correct),  but  the  usual  standard  errors  will  be  inappropriate. 

Consider  the  estimating  function  G(/3)  =  x'(Y  —  x/3).  The  “bread”  of  the 
sandwich  A~x  remains  unchanged  since  A  does  not  depend  on  Y.  The  “filling” 
becomes 

n 

B  =  var(G)  =  xTvar(Y)x  =  ^  a2x\xi}  (5.36) 

i= 1 

where  a2  =  var(  Y, )  and  we  have  assumed  that  the  data  are  uncorrelated.  Unfortu¬ 
nately,  a2  is  unknown,  but  various  simple  estimation  techniques  are  available.  An 
obvious  estimator  stems  from  setting  of  =  ( Y,  —  Xif3 )2  to  give 

n 

Bn=J2xIixi{Yi-xip)2,  (5.37) 

i=i 


and  its  use  provides  a  consistent  estimator  of  (5.36).  However,  this  variance 
estimator  has  finite  sample  downward  bias. 

For  linear  regression,  the  MLE 


1  " 

-^{Yi  ~  Xip)2 
n 


1 

n 


n 


i= 1 


is  downwardly  biased  (as  we  saw  in  Sect.  5.6.1),  with  bias  —{k  +  l)cr2/n,  which 
suggests  using 

n 

Bn  = - ^ — -  V  x'ixl(Yl  -  xji)2.  (5.38) 

TL  rv  r 

i= 1 

This  simple  correction  provides  an  estimator  of  the  variance  that  has  finite  bias,  since 
the  bias  in  a2  changes  as  a  function  of  the  design  points  xr  ,  but  will  often  improve 
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on  (5.37).  In  linear  regression,  if  var(Yi)  =  a2,  then  E [(1;  —  a^/3)2]  =  er2(l  —  ha) 
where  hu  is  the  ilh  diagonal  element  of  the  hat  matrix  x(xTx)~1xJ  (we  derive  this 
result  in  Sect.  5.1 1.2).  Therefore,  another  suggested  correction  is 


(5.39) 


For  each  of  (5.37),  (5.38),  and  (5.39),  the  variance  of  the  estimator  /3  is  consistently 
estimated  by  A~[ BnA~l . 

We  report  the  results  of  a  small  simulation  study,  in  which  we  examine  the 
performance  of  the  sandwich  estimator  as  a  function  of  n,  the  distribution  of  x,  and 
the  variance  estimator.  We  carry  out  six  sets  of  simulations  with  the  x  distribution 
either  uniform  on  (0,1)  or  exponential  with  rate  parameter  1,  and  var(  Y  \  x)  = 
E[Y  |  x]q  x  a2  with  q  =  0,1,2,  so  that  the  variance  of  the  errors  is  constant, 
increases  in  proportion  to  the  mean,  or  increases  in  proportion  to  the  square  of  the 
mean.  The  errors  are  normally  distributed  and  uncorrelated  in  all  cases  (Sect.  5.10 
considers  the  impact  of  other  forms  of  model  misspecification). 

In  Table  5.2,  we  see  that,  as  expected,  confidence  intervals  obtained  directly  from 
the  usual  variance  of  the  ordinary  least  squares  estimator,  that  is,  ( xTx)~1a 2,  give 
accurate  coverage  when  the  error  variance  is  constant.  When  the  x  distribution  is 
uniform,  the  coverage  is  accurate  even  under  variance  model  misspecification.  There 
is  poor  coverage  for  the  exponential  distribution,  however,  which  worsens  with 
increasing  n.  The  coverage  of  the  sandwich  estimator  confidence  intervals  requires 
large  samples  to  obtain  accurate  coverage  for  the  exponential  x  model.  There  is  a 
clear  efficiency  loss  when  using  sandwich  estimation,  if  the  variance  of  the  errors 
is  constant.  The  downward  bias  of  the  sandwich  estimator  based  on  the  unadjusted 
residuals  is  apparent,  though  this  bias  decreases  with  increasing  n.  Working  with 
residuals  standardized  by  nj(n  —  k  —  1),  (5.38),  improves  the  coverage,  while  the 
use  of  the  hat  matrix  version,  (5.39),  improves  performance  further. 

If  the  errors  are  correlated,  the  sandwich  estimators  of  the  variance  considered 
here  will  not  be  consistent.  Chapter  8  provides  a  description  of  sandwich  estimators 
for  the  correlated  data  situation  that  may  be  used  when  there  is  replication  across 
“clusters.” 


Example:  Prostate  Cancer 

We  fit  the  model 


log  yi  =  /30  +  /3i  log10(a,’i)  +  ez 


(5.40) 


where  yi  is  PSA  and  Xi  is  the  cancer  volume  for  individual  i  and  e,  are  assumed 
uncorrelated  with  constant  variance  a2. 
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Table  5.2  Confidence  interval  coverage  of  nominal  95%  intervals  under  a  model-based 
variance  estimator  in  which  the  variance  is  assumed  independent  of  the  mean  and  under 
three  sandwich  estimators  given  by  (5.37)-(5.39) 


n 

Model-based 

Sandwich  1 

Sandwich  2 

Sandwich  3 

5 

95 

84 

90 

93 

10 

95 

88 

91 

92 

25 

94 

92 

93 

94 

50 

95 

94 

94 

94 

100 

95 

95 

95 

95 

250 

95 

95 

95 

95 

var(y 

x) 

=  cr2,  X 

:  uniform 

5 

95 

82 

88 

92 

10 

95 

85 

88 

91 

25 

95 

89 

91 

92 

50 

95 

91 

92 

93 

100 

95 

93 

93 

94 

250 

95 

94 

94 

94 

var(y 

x) 

=  cr2,  X 

:  exponential 

5 

95 

83 

89 

92 

10 

95 

89 

92 

93 

25 

95 

92 

94 

94 

50 

95 

94 

95 

95 

100 

95 

95 

95 

95 

250 

95 

95 

95 

95 

var(y 

x) 

=  E[y 

|  x]  X  cr2. 

x  uniform 

5 

92 

76 

83 

89 

10 

90 

77 

82 

87 

25 

87 

83 

85 

88 

50 

85 

87 

88 

90 

100 

85 

90 

91 

92 

250 

83 

93 

93 

93 

var(y 

x) 

=  E[y 

|  x]  X  a 2 , 

x  exponential 

5 

95 

83 

89 

92 

10 

95 

89 

92 

93 

25 

95 

92 

93 

94 

50 

94 

94 

94 

94 

100 

95 

94 

95 

95 

250 

95 

95 

95 

95 

var(y 

x) 

=  E[y 

|  x]2  X  a2 

,  x  uniform 

5 

89 

70 

78 

86 

10 

81 

71 

75 

82 

25 

75 

78 

80 

85 

50 

73 

85 

86 

88 

100 

71 

89 

90 

91 

250 

68 

92 

92 

93 

var(y 

x) 

=  E[y 

|  x]2  X  a2 

,  x  exponential 

The  true  values  are  /3o  =  1,  =  1,  and  all  results  are  based  on  10,000  simulations.  In 

all  cases,  the  errors  are  normally  distributed  and  uncorrelated.  The  true  variance  model 
and  distribution  of  x  are  given  in  the  last  line  of  each  block 


220 


5  Linear  Models 


Table  5.3  Least  squares/maximum  likelihood  parameter 
estimates  and  model-based  and  sandwich  estimates  of  the 
standard  errors,  for  the  prostate  cancer  data 


Model-based 

Sandwich 

Parameter 

Estimate 

standard  error 

standard  error 

Po 

1.51 

0.122 

0.123 

Pi 

0.719 

0.0682 

0.0728 

Fig.  5.4  Log  of 

prostate-specific  antigen 

versus  log  of  cancer  volume, 

along  with  the  least 

squares/maximum  likelihood 

fit,  and  95%  pointwise  <" 

confidence  intervals  for  the  ^ 

expected  linear  association  ra 

(narrow  bands)  and  for  a  new 

observation  (wide  bands) 


log(cancer  volume) 


where  j/j  is  PSA  and  x,  is  cancer  volume  for  individual  i  and  e,  are  assumed 
uncorrelated  with  constant  variance  cr2 .  Table  5.3  gives  summaries  of  the  linear 
association  under  model-based  and  sandwich  variance  estimates.  The  point  esti¬ 
mates  and  model-based  standard  error  estimates  arise  from  either  ML  estimation 
(assuming  normality  of  errors)  or  ordinary  least  squares  estimation  of  f3.  The  sand¬ 
wich  estimates  of  the  standard  errors  relax  the  constancy  of  variance  assumption 
but  assume  uncorrelated  errors.  The  standard  error  of  the  intercept  is  essentially 
unchanged  under  sandwich  estimation,  when  compared  to  the  model-based  version, 
while  that  for  the  slope  is  slightly  increased.  The  sample  size  of  n  =  97  is  large 
enough  to  guarantee  asymptotic  normality  of  the  estimator.  For  a  10-fold  increase 
in  cancer  volume  (in  cc),  there  is  a  exp(/?!)  =  2.1  increase  in  PSA  concentration. 

Figure  5.4  plots  the  log  of  PSA  versus  the  log  of  cancer  volume  and  superimposes 
the  estimated  linear  association,  along  with  pointwise  95%  confidence  intervals 
for  the  expected  linear  association  and  for  a  new  observation  (assuming  normally 
distributed  data).  There  does  not  appear  to  be  any  deviation  in  random  scatter  of 
the  data  around  the  line  (a  residual  plot  would  give  a  clearer  way  of  assessing  the 
nonconstant  variance  assumption,  as  we  will  see  in  Sect.  5.10).  In  Fig.  5.5(a),  we 
plot  PSA  versus  log  cancer  volume  and  clearly  see  the  variance  of  PSA  increasing 
with  increasing  cancer  volume  on  this  scale.  Figure  5.5(b)  plots  PSA  versus  cancer 
volume.  It  is  very  difficult  to  assess  the  goodness  of  fit  of  the  fitted  relationship 
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Fig.  5.5  (a)  Prostate-specific  antigen  versus  log  cancer  volume,  (b)  Prostate-specific  antigen 
versus  cancer  volume.  In  each  case,  the  least  squares/maximum  likelihood  fit  is  included 


or  assumptions  concerning  the  mean-variance  relationship  when  the  response  and 
covariate  are  on  their  original  scales.  In  both  plots,  the  fitted  line  is  from  the  fitting 
of  model  (5.40). 


5.7  Bayesian  Inference 


We  now  consider  Bayesian  inference  for  the  linear  model.  As  with  likelihood 
inference,  we  are  required  to  specify  the  probability  of  the  data  and  we  assume 
Y  |  /3,  cr2  ~  Nn(a:/3,  cr2I„).  The  posterior  distribution  is 

p(P,  o'2  |  y)  oc  L((3,  a 2)  x  tt(/3,  a1).  (5.41) 

Closed-form  posterior  distributions  for  (3  and  a2  are  only  available  under  restricted 
prior  distributions.  In  particular,  consider  the  improper  prior  distribution 

7t(/3,  cr2)  =  p((3)  x  p(cr2)  oc  a~2  (5.42) 


Under  this  prior  and  likelihood  combination,  the  posterior  is,  up  to  proportionality. 


p(/3,  a2  |  y)  oc  (cr2)  ^n+2^2exp 


_  xPY(y  ~  XP) 


(5.43) 


To  derive  p((3  \  y),  we  need  to  integrate  a2  from  the  joint  distribution.  To  achieve 
this,  it  is  useful  to  use  an  equality  derived  earlier,  (2.23): 


(y  —  xf3)T(y  —  x/3)  =  s2(n  —  k  —  1)  +  (f3  —  (3)TxTx(l3  —  (3), 
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where  (3  is  the  ML/LS  estimate.  Substitution  into  (5.43)  gives 

p(P  I  V)  oc  J (CT2)_(n+2)/2exp  [s2{n-  k-  1) 

+  ((3  —  /3)TxTx((3  —  /3)  |  da2. 

The  integrand  here  is  the  kernel  of  an  inverse  gamma  distribution  (Appendix  D)  for 
a2  and  so  has  a  known  normalizing  constant,  the  substitution  of  which  gives 


p(P  I  y)  «  r  Q) 


s2(n  —  k  —  1)  +  ((3  —  (3)TxTx{(3  —  (3) 


—n/2 


(X 


(P  -  (3)Ts-2xTx(f3  -  (3) 
n  —  k  —  1 


-(n-fc-l+fc+l)/2 


after  some  simplification.  By  inspection  we  recognize  that  this  expression  is  the 
kernel  of  a  (k  +  l)-dimensional  t  distribution  (Appendix  D)  with  location  (3,  scale 
matrix  s2(xJx)~1,  and  n  —  k  —  1  degrees  of  freedom,  that  is, 

P\y~  Tfc+i  [( xrx)~1x1y ,  (xTx)~1s2,n  -  k  -  l]  .  (5.44) 

Consequently,  under  the  prior  (5.42),  the  Bayesian  posterior  mean  E[/3  |  y\ 
corresponds  to  the  MLE,  and  100(1  —  a)%  credible  intervals  are  identical  to 
100(1  —  a)%  conhdence  intervals,  though  of  course  the  two  intervals  have  very 
different  interpretations. 

Asymptotically,  as  with  likelihood  estimation,  it  is  the  covariance  model 
var(Y  |  x)  that  is  most  important  for  valid  inference,  and  normality  of  the  error 
terms  is  unimportant.  One  way  of  thinking  about  this  is  as  replacing  y  \  (31  a2  by 


P\P,S2~  Nfc+i  (3,  a2(xTx)  1/2 


We  may  derive  the  marginal  posterior  distribution  of  a2  as 

a2  |  y  ~  (n-p- l)s2  x  Xn-k-i>  (5-45) 

a  scaled  inverse  chi-squared  distribution.  As  in  the  frequentist  development, 
inference  for  a2  is  likely  to  be  highly  sensitive  to  the  normality  assumption. 

Although  we  can  obtain  analytic  forms  for  p{(3  \  y )  and  p(a2  \  y)  under  the 
prior  (5.42),  closed  forms  will  not  be  available  for  general  functions  of  interest. 
Direct  sampling  from  the  posterior  may  be  utilized  for  inference  in  this  case 
though.  A  sample  from  the  joint  distribution  p(/3,a2  \  y)  can  be  generated  using 
the  composition  method  (Sect.  3.8.4)  via  the  factorization 
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p(/3,  o-2  I  y)=  p(v2  I  y )  x  p{(3  |  cr2,  y), 


where  (3  \  cr2,  y  ~  N^+i  f3,a2(xTx)  1  ,  and  cr2  |  y  is  given  by  (5.45).  Independent 
samples  are  generated  via  the  pair  of  distributions 


er2(t)  ~  p{a2  |  y) 

/3{t)  ~p(/3  |  cr2(t),y), 


for  t  =  Samples  for  functions  of  interest  6  =  q(Q ,  cr2)  are  then  available 

as  0W=5(jgWj(T2(t)). 

The  conjugate  prior  (Sect.  3.7.1)  here  takes  the  form  7r(/3,  cr2)  =  7t(/3  |  cr2)7r(cr2) 
with  /3  |  cr2  ~  Nfe+i  (m,  cr2V)  and  cr-2  ~  Ga(a,  6).  However,  this  specification 
is  not  that  useful  in  practice  since  the  prior  for  (3  depends  on  cr2.  In  particular,  for 
smaller  and  smaller  cr2,  the  prior  for  f3  becomes  increasingly  concentrated  about  m 
which  would  not  seem  realistic  in  many  contexts. 

Under  other  prior  distributions,  analytic/numerical  approximations  or  sampling- 
based  techniques  are  required.  An  obvious  prior  choice  is 


/3~N(m,  V),  a  2  ~  Ga(a,  b) 


which  gives  the  posterior 


p(/3;  0-2  I  y)  oc  l((3,  cr2)-7r(/3)7r(cr2) 


which  is  intractable,  unless  V ~1  is  the  (fc  +  1)  x  (fc  +  1)  matrix  of  zeroes,  which 
is  the  improper  prior  case,  (5.42),  already  considered.  Although  the  posterior  is  not 
available  in  closed  form  under  this  prior,  it  is  straightforward  to  construct  a  blocked 
Gibbs  sampling  algorithm  (Sect.  3.8.4).  Specifically,  letting  L(/ 3,  cr2)  denote  the 
likelihood,  one  iterates  between  the  pair  of  conditional  distributions: 


p(/3  I  y,v2)  cx  L(f3,a2)n(f3) 


~  N (m*,  V*) 

p(a~2  |  y,/3)  oc  L(/3,  a2)Tr{a~'2) 


(5.46) 


(5.47) 


where 


m*  =  Wx  (3  +  (Ifc+i  —  W)  x  m 
V*  =  W  x  var(3) 


and 


W  =  (xTX  +  V  1cr2)  1(xTx). 
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Conditional  conjugacy  is  exploited  in  this  derivation;  for  details,  see  Exercise  5.4. 
For  general  prior  distributions,  the  Gibbs  sampler  is  less  convenient  because  the 
conditional  distributions  will  be  of  unrecognizable  form,  but  Metropolis-Hastings 
steps  (Sect.  3.8.2)  for  (3  \  y,  a2  and  <r~2  |  y ,  f3  are  straightforward  to  construct. 


5.8  Analysis  of  Variance 

The  analysis  of  variance,  or  ANOVA,  is  a  method  by  which  the  variability  in  the 
response  is  partitioned  into  components  due  to  the  various  classifying  variables  and 
due  to  error.  At  one  level,  the  ANOVA  model  is  just  a  special  case  of  a  multiple 
linear  regression  model,  but  ANOVA  does  not  simply  have  a  role  as  an  “outgrowth” 
of  linear  models.  Rather  Cox  and  Reid  (2000,  p.  245)  state  that  ANOVA  has  a  role 
“in  clarifying  the  structure  of  sets  of  data,  especially  relatively  complicated  mixtures 
of  crossed  and  nested  data.  This  indicates  what  contrasts  can  be  estimated  and  the 
relevant  basis  for  estimating  error.  From  this  viewpoint  the  analysis  of  variance 
table  comes  first,  then  the  linear  model,  not  vice-versa.”  A  study  of  the  analysis  of 
variance  is  intrinsically  linked  to  the  study  of  the  design  of  experiments.  Numerous 
books  exist  on  ANOVA  and  the  design  of  experiments;  here  we  only  give  a  brief 
discussion  and  introduce  the  main  concepts.  Specifically,  we  distinguish  between 
crossed  and  nested  (or  hierarchical)  designs  and  fixed  and  random  effects  modeling. 


5.8.1  One-Way  ANOVA 

Consider  the  data  in  Table  5.4,  taken  from  Davies  (1967),  which  consist  of  the  yield 
(in  grams)  from  six  randomly  chosen  batches  of  raw  material,  with  five  replicates 
each.  The  aim  of  this  experiment  was  to  find  out  to  what  extent  batch-to-batch 
variation  is  responsible  for  variation  in  the  final  product  yield. 

Data  such  as  these  correspond  to  the  simplest  situation  in  which  we  have  a  single 
factor  and  a  one-way  classification.  We  may  model  the  yield  Yt]  in  the  jth  sample 
from  batch  i  as 

Yij  =  /i  +  on  +  eij,  (5.48) 


Table  5.4  Yield  of  dyestuff  in  grams  of  standard  color,  in  each  of  six  batches 


Replicate 

observation 

Batch 

1 

2 

3 

4 

5 

6 

1 

1,545 

1,540 

1,595 

1,445 

1,595 

1,520 

2 

1,440 

1,555 

1,550 

1,440 

1,630 

1,455 

3 

1,440 

1,490 

1,605 

1,595 

1,515 

1,450 

4 

1,520 

1,560 

1,510 

1,465 

1,635 

1,480 

5 

1,580 

1,495 

1,560 

1,545 

1,625 

1,445 
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with  €ij  |  er2  ~j ^  N(0,  a2),  i  =  1, . . . ,  a,  j  =  1, . . . ,  n.  We  need  a  constraint 
to  prevent  aliasing  (Sect.  5.5.2),  with  two  possibilities  being  the  sum-to-zero 
constraint,  t  ai  =  0,  and  corner-point  constraint:  ot\  =  0.  Model  (5.48)  is 
an  example  of  a  multiple  linear  regression  with  mean 

E[Y  |  x]  =  x/3 

in  which 


Fh 
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Y2n 
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and  where  we  adopt  the  corner-point  constraint.  Suppose  we  are  interested  in 
whether  there  are  differences  between  the  strengths  from  different  looms.  No 
differences  correspond  to  the  null  hypothesis: 

H0:a1  =  ...=aa=0.  (5.49) 

Carrying  out  a  (a  —  l)/2  i-tests  leads  to  multiple  testing  problems  (Sect.  4.5). 
Viewing  this  problem  from  a  frequentist  perspective  and  with  a  =  6  batches,  we 
have  15  tests  of  pairs  of  batches,  and  with  an  individual  type  1  error  of  0.05,  this  gives 
an  overall  type  I  error  of  1  —  0.9510  =  0.54.  As  an  alternative,  we  may  test  (5.49) 
using  an  F  test  (Sect.  5.6.1).  Specifically,  the  F  statistic  is  given  by 

_  FSS(q  |  n)/ (a  —  1) 

RSS(a)/a(n  —  1)  ’ 


where 

FSS(a  |  ft)  =  RSS (/x)  —  RSS (fi,  a) 

is  the  fitted  sum  of  squares  that  results  when  a  =  [ai, . . . ,  a,,}  is  added  to  the 
model  containing  /i  only.  In  (5.50),  the  F  statistic  is  the  ratio  of  two  so-called  mean 
squares,  which  are  average  sum  of  squares,  and  under  //(l,  since  the  contributions  in 
numerator  and  denominator  are  independent,  F  ~  Fa-i,a(n-i)-  The  ANOVA  table 
associated  with  the  test  is  given  in  Table  5.5.  This  table  lays  out  the  quantities  that 
require  calculation  and  shows  the  decomposition  of  the  total  sum  of  squares  into 
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Table  5.5  ANOVA  table  for  the  one-way  classification.  The  F  statistic  is  for  a  test  of  Ho  :  ai  = 
Q2  —  ...  —  o/.a  —  0;  DF  is  short  for  degrees  of  freedom  and  EMS  for  the  expected  mean  square, 
which  is  E[SS/DF] 


Source 

Sum  of  squares 

DF 

EMS 

F  statistic 

Between 

batches 

SSi  =nJ2i=1(Xi-  -Y..)2 

a  —  1 

a  — 1 

SSi/(a-l) 

SS2/a(?r  — 1) 

Error 

ss2  =  £“=i  -  Vi.)2 

a(n  —  1) 

<T2 

Total 

SSt  =  E“=iE"=i(^-T.)2 

an  —  1 

Table  5.6  One-way  ANOVA  table  for  the  dyestuff  data;  DF  is  shorthand 
for  degrees  of  freedom 


Source 

Sum  of  squares 

DF 

Mean  square 

F  statistic 

Between 

56,  358 

5 

11,272 

4.60  (0.0044) 

batches 

Error 

58,830 

24 

2,451 

Total 

115,188 

29 

The  quantity  in  brackets  in  the  final  column  is  the  p- value 


that  due  to  groups  (batches  in  this  example)  and  that  due  to  error.  The  intuition 
behind  the  F  test  is  that  if  there  are  no  group  effects,  then  the  average  sum  of 
squares  corresponding  to  the  groups  will,  in  expectation,  equal  the  error  variance. 
Consequently,  we  see  in  Table  5.5  that  the  expected  mean  square  is  simply  a2  when 
ai  =  . . .  =  aa  =  0.  The  success  of  the  F  test  depends  on  the  fact  that  we 
may  decompose  the  overall  sum  of  squares  into  the  sum  of  the  constituent  parts 
corresponding  to  different  components,  and  these  follow  independent  x2  random 
variables. 

Table  5.6  gives  the  numerical  values  for  the  dyestuff  data  of  Table  5.4  and  results 
in  a  very  small  p- value.  As  discussed  in  Sect.  4.2,  the  calibration  of  p- values  is 
difficult,  but  for  this  relatively  small  sample  size,  a  p- value  of  0.0044  strongly 
suggests  that  the  null  is  very  unlikely  to  be  true,  and  we  would  conclude  that  there 
are  significant  differences  between  batch  means  for  these  data.  A  Bayesian  approach 
to  testing  may  be  based  on  Bayes  factors.  In  this  linear  modeling  context,  there  are 
close  links  between  the  Bayes  factor  and  the  F  statistic  (O’Hagan  1994,  Sect.  9.34), 
though  as  usual  the  interpretations  of  the  two  quantities  differ  considerably.  It  is 
straightforward  to  extend  the  F  test  to  the  case  of  different  sample  sizes  within 
looms,  that  is,  to  the  case  of  general  m,i  =  1, . . . ,  a. 

If  we  are  interested  in  the  overall  average  yield,  we  would  not  want  to  ignore 
batch  effects  if  present  (even  if  they  are  not  of  explicit  interest),  because  a  model 
with  no  batch  effects  would  not  allow  for  the  positive  correlations  that  are  induced 
between  yields  within  the  same  batch.  This  issue  is  discussed  in  far  greater  detail  in 
Chap.  8. 
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Table  5.7  Data  on  clotting 
times  (in  minutes)  for  eight 
subjects,  each  of  whom 
receives  four  treatments 


Subject 

Treatment 

1 

2 

3 

4 

Mean 

1 

8.4 

9.4 

9.8 

12.2 

9.95 

2 

12.8 

15.2 

12.9 

14.4 

13.82 

3 

9.6 

9.1 

11.2 

9.8 

9.92 

4 

9.8 

8.8 

9.9 

12.0 

10.12 

5 

8.4 

8.2 

8.5 

8.5 

8.40 

6 

8.6 

9.9 

9.8 

10.9 

9.80 

7 

8.9 

9.0 

9.2 

10.4 

9.38 

8 

7.9 

8.1 

8.2 

10.0 

8.55 

Mean 

9.30 

9.71 

9.94 

11.02 

9.99 

5.8.2  Crossed  Designs 


We  now  consider  two  factors,  which  we  label  A  and  B,  with  a  and  b  levels, 
respectively.  If  each  level  of  A  is  crossed  with  each  level  of  B,  we  have  a 
factorial  design.  Suppose  that  there  are  n  replicates  within  each  of  the  ab  cells. 
The  interaction  model  is 


Yijk  —  ft  +  Oii  +  /3j  +  Jij  +  Cijk , 


for  i  =  1, ...,  a,  j  =  1, ...  ,b,  and  k  =  1, . . . ,  n.  This  model  contains  1  +  a  +  b  + 
ab  parameters,  while  the  data  supply  only  ab  sample  means.  Therefore,  it  is  clear 
that  constraints  on  the  parameters  are  required.  In  the  corner-point  parameterization 
(Sect.  5.5.2),  the  1  +  a  +  b  constraints  are 

oii  =  Pi  =  7n  =  •  •  •  =  7i6  =  721  =  ■  •  •  7a  i  =  °- 

Alternatively,  we  may  adopt  the  sum-to-zero  constraints: 

aba  b 

J2  on  =  m  =  J2  7*3  =  o- 

i=  1  i=l  j= 1 

Table  5.7  reproduces  data  from  Armitage  and  Berry  (1994)  in  which  clotting 
times  of  plasma  are  analyzed.  These  data  are  from  a  crossed  design  in  which  each 
of  a  =  8  subjects  received  b  =  4  treatments.  The  design  is  crossed  since  each 
patient  receives  each  of  the  treatments.  These  data  also  provide  an  example  of 
a  randomized  block  design  in  which  the  aim  is  to  provide  a  more  homogeneous 
experimental  setting  within  which  to  compare  the  treatments.  Ignoring  the  blocking 
factor  increases  the  unexplained  variability  and  reduces  efficiency.  Section  8.3 
provides  further  discussion. 
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Table  5.8  ANOVA  table  for  the  two-way  crossed  classification  with  one  observation  per  cell;  DF 
is  short  for  degrees  of  freedom  and  EMS  for  the  expected  mean  square 


Source 

Sum  of  squares 

DF 

EMS 

F  statistic 

Factor  A 

ssA  =  f>Er=i(Vi.  -y..)2 

a  —  1 

SSa 

a—  1 

a—  1 

Factor  B 

ssB  =  aj:bj=1(Y.j-Yy 

b-  1 

SSB 

b-i 

2  1  v— '  b  a  2 

0-  +a^2j  =  1  p j 
6-1 

Error 

SSe  = 

(a  -  1)  (6  —  1) 

£>Se 

(a-l) 

a2 

£“=  1  E  •=!  (*y  -  Yi.  -  y.j  +  Y..? 

Total 

SSt  =  E“=iE?=i(^-W.)2 

ab  —  1 

Table  5.9  ANOVA  table  for  the  plasma  clotting  time  data  in  Table  5.7;  DF  is  short 
for  degrees  of  freedom.  The  quantity  in  brackets  in  the  final  column  is  the  p- value 


Source  of  variation 

Sum  of  squares 

DF 

Mean  square 

F  statistic 

Treatment 

13.0 

3 

4.34 

6.62  (0.0026) 

Subjects 

79.0 

7 

11.3 

17.2  (2.2  x  10“7) 

Error 

13.8 

21 

0.656 

Total 

105.8 

31 

There  are  no  replicates  within  each  of  the  8x4  cells  in  Table  5.7,  and  so  it  is  not 
possible  to  examine  interactions  between  subjects  and  treatments.  Consequently,  we 
concentrate  on  the  main  effects  only  model; 


Y'ij  fx  T-  oci  T-  f3j  T-  6ij ,  (5.51) 

for  i  =  1, . . . ,  4;  j  =  1, . . . ,  8  and  with  |  cr2  ~ud  N(0,  cr2).  Here  we  adopt 
the  corner-point  parameterization  with  ct\  =  0  and  =  0.  Table  5.8  gives  the 
generic  ANOVA  table  for  a  two-way  classification  with  no  replicates,  and  Table  5.9 
gives  the  numerical  values  for  the  plasma  data.  For  these  data,  primary  interest  is  in 
treatment  effects  (the  ctj’s),  and  Table  5.9  shows  the  steps  to  obtaining  a  p- value  of 
0.0026  for  the  null  of  H0  :  o2  =  0:3  =  a  4  =  0  which,  for  this  small  sample  size, 
points  strongly  towards  the  null  being  unlikely.  In  passing,  we  note  that  there  are 
large  between-subject  differences  for  these  data,  so  that  the  crossed  design  is  very 
efficient. 

We  now  examine  treatment  differences  using  estimation.  Under  the  improper 
prior 

pO,a,/3,er2)  ex 

<7Z 

interval  estimates  obtained  from  Bayesian,  likelihood,  and  least  squares  analyses 
are  identical.  We  take  a  Bayesian  stance  and  report  the  posterior  distribution  for 
each  of  the  treatment  effects.  We  let  6  =  [/i,  a1  f3\  where  a  =  [ct2,  0:3, 0:4] 
and  f3  =  [/32, . . . ,  /3g] -  The  joint  posterior  for  6  is  multivariate  Student’s  t,  with 
n  —  k  —  1  =  32  —  11  =  21  degrees  of  freedom,  posterior  mean  0  (the  least 
squares  estimate)  and  posterior  scale  matrix,  (xTx)~1a2 ,  where  a2  is  the  usual 
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Fig.  5.6  Marginal  posterior 
distributions  for  the  treatment 
contrasts,  with  treatment  1  as 
the  baseline,  for  the  plasma 
clotting  time  data  in  Table  5.7 


Treatment  Difference 


unbiased  estimator  of  the  residual  error  variance.  Since  treatment  1  is  the  reference 
we  examine  treatment  differences  with  respect  to  this  baseline  group.  Figure  5.6 
gives  the  posterior  distributions  for  a2,  a$,  «4.  The  posterior  probabilities  that  the 
average  responses  under  treatments  2,  3,  and  4  are  greater  than  zero  are  0.16, 
0.065,  and  0.00017,  respectively.  Consequently,  we  conclude  that  there  is  strong 
evidence  that  treatment  4  differs  from  treatment  1,  with  decreasingly  lesser  evidence 
of  differences  between  treatment  1  and  treatments  3  and  2. 


5.8.3  Nested  Designs 

For  a  design  with  two  factors,  suppose  that  Y.tjk  denotes  a  response  at  level  i  of 
factor  A  and  level  j  of  factor  B,  with  replication  indexed  by  k.  In  a  nested  design, 
in  contrast  to  a  crossed  design,  j  =  1  in  level  1  of  factor  A  has  no  meaningful 
connection  with  j  =  1  in  level  2  of  factor  A.  In  the  context  of  the  previous  example, 
suppose  each  of  eight  patients  received  a  single  treatment  each,  but  with  k  replicate 
measurements.  In  this  case,  we  again  have  two  factors,  treatments  and  patients,  but 
the  patient  effects  are  nested  within  treatments.  A  nested  model  for  two  factors  is 


^ ijk  —  /t  T  OLi  T  3j ( i j  4~  ^ijki 


with  i  =  1, . . . ,  a  indexing  factor  A  and  j  =  1 ....  .A  factor  B.  In  the  nested 
patient/treatment  example,  A  represents  treatment  and  B  patient,  and  so  3j(i) 
represents  the  change  in  expected  response  for  patient  j  within  level  i  of  treatment. 
Notice  that  there  is  no  interaction  in  the  model,  because  factor  B  is  nested  within 
factor  A,  and  not  crossed,  and  so  there  is  no  way  of  estimating  the  usual  interactions. 
In  a  sense,  is  an  interaction  parameter  since  it  is  the  patient  effect  specific  to  a 
particular  treatment.  Table  5.10  gives  the  ANOVA  table  for  this  design. 
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Table  5.10  ANOVA  table  for  the  two-way  nested  classification;  DF  is  short  for  degrees  of 
freedom  and  EMS  for  the  expected  mean  square 


Source  Sum  of  squares  DF  EMS  F  statistic 


Factor  A 

SSA=6n£“=1(Yi..-Y...)2 

a  —  1 

ssA 

<t2+6ti  £?=!“? 

a—  1 

a  —  1 

Factor  B 

SSfl=nE?=1E5=1(Fy.-Fi..)2 

a{b  —  1) 

SSR 

2  |  v'b  q 2 

+aEj=l  Pj 

(a(b-l) 

6-1 

(within  A) 

SSB 

Error 

SS_B  = 

ab(n  —  1) 

a2 

(a— l)(fa— 1) 

£“=i£?=i£r=iO^- 

Yij.)2 

Total 

SS  T  = 

abn  —  1 

£“=i£?=i£r=i  Vm.- 
Y...)2 

5.8.4  Random  and  Mixed  Effects  Models 

The  examples  we  have  presented  so  far  are  known,  in  the  frequentist  literature,  as 
fixed  effects  ANOVA  models  since  the  parameters,  for  example,  the  afs  in  the 
one-way  classification,  are  viewed  as  nonrandom.  An  alternative  random  effects 
approach  is  to  view  these  parameters  as  a  sample  from  a  probability  distribution, 
with  the  usual  choice  being  a*  |  er^  ~iid  N(0,  of).  From  a  frequentist  perspective, 
the  choice  is  based  on  whether  the  units  that  are  selected  can  be  viewed  as  being 
a  random  sample  from  some  larger  distribution  of  effects.  Often,  patients  in  a  trial 
may  be  regarded  as  a  random  sample  from  some  population,  while  treatment  effects 
may  be  regarded  as  fixed  effects.  In  this  case,  we  have  a  mixed  effects  model. 
Model  (5.51)  was  used  for  the  data  in  Table  5.7  with  the  a,  and  Bj  being  treated  as 
fixed  effects.  Alternatively,  we  could  use  a  mixed  effects  model  with  the  individual 
effects  a,  being  treated  as  random  effects  and  the  'fi,  representing  treatment  effects, 
being  seen  as  fixed  effects. 

From  a  Bayesian  perspective,  the  distinction  being  fixed  and  random  effects  is 
less  distinct  since  all  unknowns  are  viewed  as  random  variables.  However,  the  prior 
choice  reflects  the  distinction.  For  example,  in  model  (5.51),  the  “fixed  effects” 
corresponding  to  treatments  may  be  assigned  independent  prior  distributions  Bj  ~ 
N(0,  V)  where  V  is  fixed,  while  the  “random  effects”  corresponding  to  patients  may 
be  assigned  the  prior  a*  |  o\  ~ud  N(0,  of)  with  o 2  assigned  a  prior  and  estimated 
from  the  data. 

A  full  description  of  estimation  for  random  and  mixed  effects  models  will  be 
postponed  until  Chap.  8,  though  here  we  briefly  describe  likelihood-based  inference 
for  the  one-way  model  (5.48).  Readers  who  have  not  previously  encountered 
random  effects  models  may  wish  to  skip  the  remainder  of  this  section  and  return 
after  consulting  Chap.  8.  The  one-way  model  is 


^  ij  —  ft  T  OLi  T  Cij , 
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Table  5.11  ANOVA  table  for  test  of  Ho  :  o-2  =  0;  DF  is  short  for  degrees 
of  freedom  and  EMS  for  the  expected  mean  square 


Source 

Sum  of  squares 

DF 

EMS 

Between  batches 

nU=i  (W-T.)2 

a  —  1 

cr2  -)-  no-2 

Error 

E?=iE?=i  Ov-W)2 

a(n  —  1) 

o2 

Total 

£5UE?=i(y«-*-)2 

an  —  1 

where  we  have  the  usual  assumption  |  cr2  N(0,  cr2),  j  =  1 and  add 
ai  I  N(0,  cr2 ),  j  =  1, . . . ,  a  as  the  random  effects  distribution.  We  no  longer 

need  a  constraint  on  the  a,’s  in  the  random  effects  model  since  these  parameters 
are  “tied  together’’  via  the  normality  assumption.  A  primary  question  of  interest  is 
often  whether  there  are  between-unit  differences,  and  this  can  be  examined  via  the 
hypothesis  Hq  :  cr2  =  0.  In  the  one-way  classification,  this  test  turns  out  to  be 
equivalent  to  the  F  test  given  previously  in  Sect.  5.8.1,  though  this  equivalence  is 
not  true  for  more  complex  models.  The  ANOVA  table  given  in  Table  5.11  is  very 
similar  to  that  for  the  fixed  effects  model  form  in  Table  5.5,  though  we  highlight  the 
difference  in  the  final  column. 

Estimation  via  a  likelihood  approach  proceeds  by  integrating  the  a,;  from  the 
model  to  give  the  marginal  distribution 

p{yi  I  P,  =  J  p{yi  I  Mi  OLi,  CT2)  X  p(on  I  <72  )  dati, 

and  results  in 

Vi  I  ft,  CT  ,  <Ja  ~ ad  N(/ilr,  CT  Ir  +  <TaJr), 

where  lr  is  the  r  x  1  vector  of  1  ’s,  I,  is  the  r  x  r  identity  matrix,  and  Jr  is  the 
r  x  r  matrix  of  l’s.  This  likelihood  can  be  maximized  with  respect  to  /i,cr^,cr2, 
and  asymptotic  standard  errors  may  be  calculated  from  the  information  matrix.  A 
Bayesian  approach  combines  the  marginal  likelihood  with  a  prior  7t(/k,  cr2 ,  cr2). 


5.9  Bias-Variance  Trade-Off 

Chapter  4  gave  an  extended  discussion  of  model  formulation  and  model  selection, 
and  the  example  at  the  end  of  Sect.  4.8  acted  as  a  prelude  to  this  section  in  which 
we  describe  the  bias-variance  trade-off  that  is  encountered  when  we  consider  which 
variables  to  include  in  a  model. 

Suppose  the  true  model  is 


Y  =  x/3  +  e, 
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where  Y  is  n  x  1,  x  is  n  x  (k  +  1),  (3  is  (k  +  1)  x  1,  and  the  errors  are  such  that 
E[e]  =  0  and  var(e)  =  er2I„.  We  have  seen  that  the  estimator 

( 3  =  (xTx)~1xTY , 

arises  from  ordinary  least  squares,  likelihood  (with  normal  errors,  or  large  n), 
and  Bayesian  (with  normal  errors  and  prior  (5.42),  or  large  n)  considerations. 
Asymptotically, 

(xTx)1/2((3n  -  (3)  ~^d  Nfe+i(0,  a2ln) 

where  we  assume  xJx  is  of  full  rank.  Since  x'x  is  positive  definite  (all  proper 
variance-covariance  matrices  are  positive  definite),  we  can  find  a  unique  Cholesky 
decomposition  that  is  an  upper-triangular  matrix  U  such  that  ( xTx)~ 1  =  UU'  . 
Proofs  of  the  matrix  results  in  this  section  may  be  found  in  Schott  (1997,  p.139- 
140).  This  decomposition  leads  to 


fc+i 

var(3  j)  = 


i=i 


with  Uji  =  0  if  j  >  l. 

We  now  split  the  collection  of  predictors  into  two  groups,  x  =  \xA.  xH],  and 
examine  the  implications  of  regressing  on  a  subset  of  predictors.  Let  (3  =  [/3A,  /3JT 
where  xA  is  n  x  (q  +  1)  with  q  <  k  and  /3A  is  (q  +  1)  x  1.  Now  suppose  we  fit 
the  model 

|  xA:  a?B]  ==  xAf3A 

where  we  distinguish  between  (3 *  and  f3A  since  the  interpretation  of  the  two  sets 
of  parameters  differs.  In  particular,  each  coefficient  in  (3A  has  an  interpretation 
as  the  linear  association  of  the  corresponding  variable,  controlling  for  all  of  the 
other  variables  in  x.  For  coefficients  in  (3 *,  control  is  only  for  variables  in  xA.  The 
estimator  in  the  reduced  model  is 


(3a  =  (®>A)  x\Y 


and 


E[3I]  =  (x\xA)-lx\E  [Y] 

=  (xlxA)~1xl(xA(3A  +  xb(3b) 
=  Pa  +  i.x\xP~lxlx^PB, 


(5.52) 
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so  that  the  second  term  is  the  bias  arising  from  omission  of  the  last  k  —  q  covariates. 
This  defines  the  quantity  that  is  being  consistently  estimated  by  /3A .  An  alternative, 
less  direct,  derivation  follows  from  the  results  of  Sect.  2.4.3  in  which  we  showed  that 
the  Kullback-Leibler  distance  between  the  true  model  and  the  reduced  (assumed) 
model  is  that  which  is  being  minimized. 

From  (5.52),  we  see  that  the  bias  is  zero  if  xA  and  xB  are  orthogonal,  or  if  (3B  =  0. 
Consequently,  for  bias  to  result,  we  need  xB  to  be  associated  with  both  the  response 
Y  and  at  least  one  of  the  variables  in  xA.  These  requirements,  roughly  speaking,  are 
the  conditions  for  xls  to  be  considered  a  confounder.  More  precisely,  Rothman  and 
Greenland  (1998)  give  the  following  criteria  for  a  confounder: 

1.  A  confounding  variable  must  be  associated  with  the  response. 

2.  A  confounding  variable  must  be  associated  with  the  variable  of  interest  in  the 
population  from  which  the  data  are  sampled. 

3.  A  confounding  variable  must  not  be  affected  by  the  variable  of  interest  or 
the  response.  In  particular  it  cannot  be  an  intermediate  step  in  the  causal  path 
between  the  variable  of  interest  and  the  response. 

At  first  sight,  this  result  suggests  that  we  should  include  as  many  variables  as 
possible  in  the  mean  model,  since  this  will  reduce  bias.  But  the  splitting  of  the  mean 
squared  error  of  an  estimator  into  the  sum  of  the  squared  bias  and  the  variance 
shows  that  this  is  only  half  of  the  story.  Unfortunately,  including  variables  that  are 
not  associated  (or  have  a  weak  association  only)  with  Y  can  increase  the  variance 
of  the  estimator  (or  equivalently,  the  posterior  variance),  as  we  now  demonstrate. 
We  write 

(*I*A)_1  =  uAu; 

where  UA  is  upper-triangular  and  consists  of  the  first  q+  1  rows  and  columns  of  U. 
Denoting  the  jth  element  of  the  estimators  from  the  reduced  and  full  models  as  /3* 
and  BAj ,  retrospectively,  we  have 


9+1 

var(/3*  )  =  a2  X!  uii 
1=1 

<  var(/3Ai), 

for  j  =  0, 1, . . . ,  q,  with  equality  if  and  only  if  xA  and  xB  are  orthogonal. 

Hence,  if  a2  is  fixed  across  analyses,  we  conclude  that  adding  covariates 
decreases  precision.  Intuitively  this  is  because  there  is  only  so  much  information 
within  a  dataset,  and  if  we  add  in  variables  that  are  related  to  Y  and  are  not 
orthogonal  to  existing  variables,  the  associations  are  not  so  accurately  estimated 
since  there  are  now  competing  explanations  for  the  data. 

Another  layer  of  complexity  is  added  when  we  take  into  account  estimation  of 
a2  since  the  estimated  standard  errors  of  the  estimator  now  depend  on  a2.  The  usual 
unbiased  estimator  is  given  by  the  residual  sum  of  squares  divided  by  the  degrees  of 
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freedom.  The  former  is  nonincreasing  as  covariates  are  added  to  the  model,  and  the 
latter  is  decreasing.  Consequently,  as  variables  are  entered  into  the  model  in  terms 
of  their  “significance,”  a  typical  pattern  is  for  a2  to  decrease  with  the  addition  of 
important  covariates,  with  an  increase  then  occurring  as  variables  that  are  almost 
unrelated  are  added  (due  to  the  decrease  in  the  denominator  of  the  estimator). 

To  expand  on  this  further,  consider  the  “true”  model  in  which  we  assume  for 
simplicity  that  /3B  is  univariate  so  that  xH  is  n  x  1: 

y  =  xAf3A  +  xBf3B  +  e 

where  E[e]  =  0  and  var(e)  =  a2 In.  We  now  fit  the  model 

Y  =  xa/3*  +  e*, 


so  that  xB  is  omitted.  Then,  viewing  _X"B  as  random  (since  it  is  unobserved),  we 
obtain 

var(Y  |  xA)  =  a2In  +  f32v ar(XB  |  xA), 

showing  the  form  of  the  increase  in  residual  variance  (unless  /3B  =  0)  when  variables 
related  to  the  response  are  added  to  the  model.  If  xA  and  xB  are  collinear,  the 
variance  of  XK  does  not  depend  on  xA. 

We  expand  on  the  development  of  this  section,  with  a  slight  change  of  notation, 
via  the  “true”  model 


Yi  =  Po  +  /3a(Xi  -  x)  +  PB(Zi  -  z)  + 


and  fitted  model 

Yi  =  /3q  +  (3a  ( Xi  —  x)  +  ti. 

Then,  /3o  =  /3q  —  Y  (since  the  covariates  are  centered  in  each  model),  and  so  each 
is  an  unbiased  estimator  of  the  intercept: 

E[3o]  =  E[/3q]  =  A). 

From  (5.52), 

E[^]  =/3a  +  /3b  x 

fs  \1/2 

=  /3 A  +  Axfc  (5.53) 

\  £>xx  J 


5.9  Bias- Variance  Trade-Off 
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where 


n 


n  n 


5>i  -  x)(zi  -  z)  szz  =  -  z)2 


and 


{SXXSZZ)V  2' 


We  have  seen  (5.53)  before  in  a  slightly  different  form,  namely  (5.11)  in  the  context 
of  confounding.  In  the  full  model  we  have 


1  jn 


~l/n  0  0 

(x'x)-1  =  0  Szz/D  —Sxz/D  , 

.  0  -Sxz/D  Sxx/D  _ 


where  D  =  SXXSZZ  -  S2Z,  giving 


with  equality  if  and  only  if  Sxz  =  0  (so  that  X  and  Z  are  orthogonal),  assuming 
that  (T2  is  known. 

When  deciding  upon  the  number  of  covariates  for  inclusion  in  the  mean  model, 
there  are  therefore  competing  factors  to  consider.  The  bias  in  the  estimator  cannot 
increase  as  more  variables  are  added,  but  the  precision  of  the  estimator  may  increase 
or  decrease,  depending  on  the  strength  of  the  associations  of  the  variables  that  are 
candidates  for  inclusion.  The  unexplained  variation  in  the  data  (measured  through 
a2)  may  be  reduced,  but  the  uncertainty  in  which  of  the  covariates  to  assign  the 
variation  in  the  response  to  is  increased.  If  the  number  of  potential  additional 
variables  is  large,  the  loss  of  precision  may  be  considerable. 

Section  4.8  described  and  critiqued  various  approaches  to  variable  selection, 
emphasizing  that  the  strategy  taken  is  highly  dependent  on  the  context  and  in 
particular  whether  the  aim  is  exploratory,  confirmatory,  or  predictive.  Chapter  12 
considers  the  latter  case  in  detail. 


Example:  Prostate  Cancer 

In  this  section  we  briefly  illustrate  the  ideas  of  the  previous  section  using  two 
covariates  from  the  PSA  dataset,  log(can  vol)  which  we  denote  x-j  and  log(cap  pen) 
which  we  denote  X\.  Let  x  =  [£1,2:2]  and  recall  Y  is  log(PSA).  Figure  5.7(a) 
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log(can  vol)  log(cap  pen) 

Fig.  5.7  (a)  Association  between  log  capsular  penetration  and  log  cancer  volume,  with  fitted  line, 
(b)  association  between  log  prostate-specific  antigen  and  log  capsular  penetration,  with  fitted  line 


plots  X2  versus  X\  and  illustrates  the  strong  association  between  these  variables. 
Figure  5.7(b)  plots  Y  versus  x\,  and  we  see  an  association  here  too.  We  obtain  the 
following  estimates: 


E  [Y 

1  »] 

—  Po  +  Plxl 

(5.54) 

=  1.51  +  0.72  x  xi 

(5.55) 

E  [Y 

1  *] 

=  Po  +  PlXl  +  P2X2 

(5.56) 

=  1.61  +  0.66  x  xi  +  0.080  x  X2 

(5.57) 

E[x2  | 

Xl] 

=  a  +  bx  1 

=  —12.6  +  0.80  x  xi 

(5.58) 

We  first  confirm,  using  (5.12)  and  (5.11),  that  the  estimate  associated  with  log(can 
vol)  in  model  (5.54)  combines  the  effect  of  this  variable  and  log(cap  pen): 


Pi  —  Pi  +  b  X  /?2 

=  0.66  +  0.80  x  0.08  =  0.72, 


with  b  from  (5.58),  to  give  the  estimate  appearing  in  (5.55).  The  standard  error 
associated  with  Xi  in  model  (5.54)  is  0.068,  while  in  the  full  model  (5.56),  it 
increases  to  0.092  due  to  the  association  observed  in  Fig.  5.7a  between  X\  and  x2. 


5.10  Robustness  to  Assumptions 


In  this  section  we  investigate  the  behavior  of  the  estimator 
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under  departures  from  the  assumptions  that  lead  to 

{xYx)1/20n-  (3)  d  Nfc+i(0fc+1;  tr2Ifc+i). 

Correct  inference  arises  from  normality  of  the  estimator,  and  the  error  terms  should 
have  constant  variance  and  absence  of  correlation.  Normality  of  the  estimator  occurs 
with  a  sufficiently  large  sample  size  or  if  the  error  terms  are  normal.  Judging  when 
the  sample  size  is  large  enough  can  be  assessed  through  simulation,  and  there  is 
an  interplay  between  sample  size  and  the  closeness  of  the  error  distribution  to 
normality.  We  present  results  examining  the  effect  of  departures  on  confidence 
interval  coverage,  but  these  are  identical  to  Bayesian  credible  intervals  under  the 
improper  prior  (5.42).  Regardless  of  the  distribution  of  the  errors  and  the  mean- 
variance  relationship,  we  always  obtain  an  unbiased  estimator,  hence  the  emphasis 
on  confidence  interval  coverage. 


5.10.1  Distribution  of  Errors 

We  begin  by  examining  the  effect  of  non-normality  of  the  errors  and  simulate  data 
from  a  linear  model  with  errors  that  are  uncorrelated  and  with  constant  variance. 
The  distribution  of  the  errors  is  taken  as  either  normal,  Laplacian,  Student’s  t 
with  3  degrees  of  freedom,  or  lognormal.  We  examine  the  behavior  of  the  least 
squares  estimator  for  /3i.  with  n  =  5  and  n  =  20,  and  two  distributions  for  the 
covariate,  either  Xi  ~ad  U(0, 1)  or  Xi  ^  Ga(l,  1)  (an  exponential  distribution), 
for  i  =  1  The  latter  was  chosen  to  examine  the  effects  of  a  skewed  covariate 

distribution. 

Table  5.12  presents  the  95%  confidence  interval  coverage  for  j3\ ;  based  on  10,000 
simulations,  the  true  value  is  /?i  =  0.  For  the  normal  error  distributions,  the  coverage 
should  be  exactly  95%,  but  we  include  simulation-based  results  to  give  an  indication 
of  the  Monte  Carlo  error.  In  all  cases  the  coverage  probabilities  are  good,  showing 
the  robustness  of  inference  in  this  simple  scenario.  When  the  number  of  covariates, 
k  is  large  relative  to  n ,  more  care  is  required,  especially  if  the  distributions  of  the 
covariate  are  very  skewed.  Lumley  et  al.  (2002)  discuss  the  validity  of  the  least 
squares  estimator  when  the  data  are  not  normal. 


5.10.2  Nonconstant  Variance 

We  have  already  considered  the  robustness  of  inference  to  nonconstant  error 
variance  in  Sect.  5.6.4,  in  the  context  of  sandwich  estimation.  Table  5.2  showed 
that  confidence  interval  coverage  will  be  poor  when  an  incorrect  mean-variance 
relationship  is  assumed.  Sandwich  estimation  provides  a  good  frequentist  alternative 
estimation  strategy,  so  long  as  the  sample  size  is  large  enough  for  the  variance  of 
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Table  5.12  Coverage  of 

95%  confidence  intervals  for 

Error  distribution 

Distribution  of  x 

n 

Coverage 

Nonnal  N(0, 1) 

Uniform 

5 

95 

01  for  various  error 
distributions,  distributions  of 

Nonnal  N(0, 1) 

Uniform 

20 

94 

the  covariate,  and  sample 

Nonnal  N(0, 1) 

Exponential 

5 

95 

sizes  n.  The  entries  are  based 

Nonnal  N(0, 1) 

Exponential 

20 

95 

on  10,000  simulations 

Laplacian  Lap(0,  1) 

Uniform 

5 

95 

Laplacian  Lap(0,  1) 

Uniform 

20 

95 

Laplacian  Lap(0,  1) 

Exponential 

5 

94 

Laplacian  Lap(0,  1) 

Exponential 

20 

95 

Student  T(0, 1,3) 

Uniform 

5 

95 

Student  T(0, 1,  3) 

Uniform 

20 

95 

Student  T(0, 1,3) 

Exponential 

5 

95 

Student  T(0, 1,3) 

Exponential 

20 

95 

Lognormal  LN(0,  1) 

Uniform 

5 

95 

Lognormal  LN(0,  1) 

Uniform 

20 

96 

Lognormal  LN(0,  1) 

Exponential 

5 

94 

Lognormal  LN(0,  1) 

Exponential 

20 

95 

the  estimator  to  be  reliably  estimated.  The  bootstrap  (Sect.  2.7)  provides  another 
method  for  reliable  variance  estimation,  again  when  the  sample  size  is  not  small. 


5.10.3  Correlated  Errors 


Finally  we  investigate  the  effect  on  coverage  of  correlated  error  terms.  A  simple 
scenario  to  imagine  is  {x,  y)  pairs  collected  on  consecutive  days.  We  assume  an 
AR(1)  autoregression  model  of  order  1  (Sect.  8.4.2)  which  results  in  e  |  er2  ~ 
N(0„,  a2  V)1  where  V  is  the  n  x  n  matrix 


1  p  ■■■  p 
p  ■■■  1 


and  with  p  the  correlation  between  errors  on  successive  days.  Table  5.13  gives 
the  95%  confidence  interval  coverage  (arising  from  a  model  in  which  the  errors 
are  assumed  uncorrelated)  as  a  function  of  sample  size,  the  distribution  of  x 
(uniform  or  exponential),  and  strength  of  correlation.  The  table  clearly  shows 
that  correlated  errors  can  drastically  impact  confidence  interval  coverage,  with  the 
coverage  becoming  increasingly  bad  as  the  sample  size  increases. 
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Table  5.13  95%  confidence 
interval  for  the  slope 
parameter  /3i  as  a  function  of 
the  autocorrelation  parameter 
p  and  the  sample  size  n.  The 
entries  are  based  upon  10,000 
simulations  and  are  calculated 
under  a  model  in  which  the 
errors  are  assumed 
uncorrelated 


Distribution  of  x 

Correlation  p 

n 

Coverage 

Uniform 

0.1 

5 

94 

Uniform 

0.1 

20 

93 

Uniform 

0.1 

50 

92 

Uniform 

0.5 

5 

89 

Uniform 

0.5 

20 

76 

Uniform 

0.5 

50 

75 

Uniform 

0.95 

5 

79 

Uniform 

0.95 

20 

36 

Uniform 

0.95 

50 

26 

Exponential 

0.1 

5 

94 

Exponential 

0.1 

20 

93 

Exponential 

0.1 

50 

93 

Exponential 

0.5 

5 

89 

Exponential 

0.5 

20 

79 

Exponential 

0.5 

50 

77 

Exponential 

0.95 

5 

81 

Exponential 

0.95 

20 

41 

Exponential 

0.95 

50 

32 

Intuitively,  one  might  expect  that  in  this  situation  the  standard  errors  based  on 
(x'x)-1  a1  would  always  underestimate  the  true  standard  error  of  the  estimator. 
In  the  scenario  described  above,  the  effect  of  correlated  errors  depends  critically 
upon  the  correlation  among  the  x  variables  across  time,  however.  If  the  .x'-variahle  is 
slowly  varying  over  time,  then  the  standard  errors  will  be  underestimated,  but  if  the 
variable  is  changing  rapidly,  then  the  true  standard  errors  may  be  smaller  than  those 
reported.  This  is  because  if  there  is  high  positive  correlation,  then  the  difference  in 
the  error  terms  on  consecutive  days  is  small,  and  so  if  Y  changes,  it  must  be  due  to 
changes  in  x.  For  further  discussion,  see  Sect.  8.3. 


5.11  Assessment  of  Assumptions 


In  this  section  we  will  describe  a  number  of  approaches  for  assessing  the  assump¬ 
tions  required  for  valid  inference. 


5.11.1  Review  of  Assumptions 

We  consider  the  linear  regression  model: 


Y  =  x(3  +  e 
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where  Y  is  n  x  1,  x  is  n  x  (k  +  1),  (3  is  (fc  +  1)  x  1,  and  e  is  n  x  1,  with 
e  |  a1  ~  N„(0,  fj2In).  Under  these  assumptions,  we  have  seen  that  the  estimator 
(3  =  (xTx)~1xTY,  with  var(/3)  =  (xTx)~1cr2,  emerges  from  likelihood,  least 
squares,  and  Bayesian  approaches.  The  standard  errors  and  confidence  intervals  we 
report  are  valid  if: 

•  The  error  terms  have  constant  variance.  If  sandwich  estimation  is  used,  then  this 
assumption  may  be  relaxed,  so  long  as  we  have  a  large  sample  size. 

•  The  error  terms  are  uncorrelated. 

•  The  estimator  is  normally  distributed,  so  that  we  can  effectively  replace  the 
likelihood  Y  \  (3,  a2  by  (3  \  (3  ~  Np  [/3,  (xt£c)_1ct2]  .  This  occurs  if  the  error 
terms  are  normally  distributed  and/or  the  sample  size  n  is  sufficiently  large  for 
the  central  limit  theorem  to  ensure  that  the  estimator  is  normally  distributed. 

As  we  saw  in  Sect.  5.10,  confidence  interval  coverage  can  be  very  poor  if  the  error 
variance  is  nonconstant  and/or  the  errors  are  correlated.  Normality  of  errors  is  not 
a  big  issue  with  the  linear  model  with  respect  to  estimation  (which  explains  the 
popularity  of  least  squares),  unless  the  sample  size  is  very  small  (relative  to  the 
number  of  predictors)  or  the  distribution  of  the  x  values  is  very  skewed.  For  validity 
of  a  predictive  interval  for  an  observable,  we  need  to  make  a  further  assumption 
concerning  the  distribution  of  the  error  terms,  however.  This  interval  is  given 
by  (5.30)  under  the  assumption  of  normal  errors. 

From  a  frequentist  perspective  and  given  the  assumed  mean  model,  E[V  |  x]  = 
x/3.  the  estimator  (3  is  an  unbiased  estimator  of  /3.  For  example,  in  simple  linear 
regression,  fli  is  an  unbiased  estimator  of  the  linear  association  in  a  population, 
regardless  of  the  true  relationship  between  response  and  covariate.  The  assumed 
mean  model  may  be  a  poor  description,  however,  and  we  will  usually  wish  to 
examine  the  appropriateness  of  the  model  to  decide  on  whether  linearity  holds. 

Another  aspect  of  model  checking  is  scrutinizing  the  data  for  outlying  or 
influential  points.  It  is  difficult  to  define  exactly  what  is  meant  by  an  outlier,  and 
we  content  ourselves  with  a  fuzzy  description  of  an  outlier  as  “a  data  point  that  is 
unusual  relative  to  the  others.”  Single  outlying  observations  may  stand  out  in  the 
plots  described  below.  The  presence  of  multiple  outliers  is  more  troublesome  due  to 
masking,  in  which  the  presence  of  an  outlier  is  hidden  by  other  outliers. 


5.11.2  Residuals  and  Influence 

In  general,  model  checking  may  be  carried  out  locally,  using  informal  techniques 
such  as  residual  plots,  or  globally  using  formal  testing  procedures;  we  concentrate 
on  the  former.  The  obsen’ed  error  is  given  by 

=  Yi  Yi, 


(5.59) 


5.11  Assessment  of  Assumptions 


241 


where  Y,  =  x,(3,  while  the  true  error  is 

=  Yi  -  E [Yi  |  Xi]. 

In  residual  analysis  we  examine  the  observed  residuals  for  discrepancies  from  the 
assumed  model.  We  define  residuals  as 


e  =  [e1} . . . ,  en]T  =  Y  -  Y  =  (I„  -  h)Y,  (5.60) 


where  h  =  x(xTx)~1xT  is  the  hat  (or  projection)  matrix  encountered  in  Sect.  5.6.3. 
The  hat  matrix  is  symmetric,  hT  =  h,  and  idempotent,  hhT  =  h.  We  want  to 
examine  the  relationship  between  e  and  e  so  we  can  use  the  former  to  assess  whether 
assumptions  concerning  the  latter  hold. 

Substitution  of 


into  (5.60)  gives 


Y  =  x(5  +  e 


e  =  (In  -  h)e, 


or 


=  -  V  h , 


3= 1 


(5.61) 

(5.62) 


showing  that  the  estimated  residuals  differ  from  the  true  residuals,  complicating 
residual  analysis. 

We  examine  the  moments  of  the  error  terms.  The  residuals  e  are  random  variables 
since  they  are  a  function  of  the  random  variables  e.  We  have 


E[e]  =  (I„  -  h)E[e]  =  0„ 


and  the  variance-covariance  matrix  is 


var(e)  =  (I„  -  h)( I„  -  h)Ta2  =  (I„  -  h)a 2, 

so  that  fitting  the  model  has  induced  dependence  in  the  residuals.  In  particular, 

var(ei)  =  (1  -  hu)a2 , 

since  for  a  symmetric  and  idempotent  matrix  hu  =  h'fj  (see  Schott  1997, 

p.  374),  and 


CO  v(ei)Gj')  —  hiji 


showing  that  the  observed  errors  have  correlation  given  by 
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corr  (ei,ej)  = 


[(1  -  hu)(l-  hjj)]1/2' 


Consequently,  even  if  the  model  is  correctly  specified,  the  residuals  have  noncon¬ 
stant  variance  and  are  correlated.  We  may  write 

n 

Yi=hiiYi+  hvYY  (5.63) 

i=i 

so  that  if  ha  is  large  relative  to  the  other  elements  in  the  ?'th  row  of  h,  then  the  zth 
fitted  value  will  be  largely  influenced  by  Y,  ;  ha  is  known  as  the  leverage.  Note  that 
the  leverage  depends  on  the  design  matrix  (i.e.,  the  tc’s)  only.  Exercise  5.8  shows  that 
tr (h)  =  k  + 1  so  the  average  leverage  is  at  least  (k+l)/n.  If  ha  =  1,  yi  =  Xif3  and 
the  zth  observation  is  fitted  exactly,  using  a  single  degree  of  freedom  for  this  point 
alone,  which  is  not  desirable. 

Based  on  these  results  we  may  define  standardized  residuals: 


Y,  -  Y, 


e,  = 


(5.64) 


for  i  =  1 , ...  ,n,  and  where  a  is  an  unbiased  estimator  of  a.  These  residuals  have 
mean  E[cre*]  =  0  and  variance  var(ire*)  =  cr2,  but  they  are  not  independent  since 
they  are  based  on  n  —  k  —  1  independent  quantities.  Often  the  (1  —  ha)1/2  terms  in 
the  denominator  of  (5.64)  are  ignored. 

For  the  simple  linear  regression  model. 


_  1  ( Xj  -  x)2 

n  Ynk=i{xk-x)2 


and 

1  {Xi  -  x)(xj  -  x) 

hij  =  -+  ^ n  , - — Zv f- 

n  22k=i(xk-x)2 

Therefore,  with  respect  to  (5.63),  we  see  that  an  extreme  Xi  value  produces  a  fitted 
value  Y,  that  is  more  heavily  influenced  by  the  observed  value  of  V,  .  Such  X{  values 
also  influence  other  fitted  values,  particularly  those  with  x  values  not  close  to  x.  The 
two  constraints  on  the  model  are 

n  n 

Yjei  =  YJYi-Yi  =  ® 

i=  1  i=  1 

n  n 

i= 1  i=l 

which  induces  correlation  in  the  e^’s. 


5.11  Assessment  of  Assumptions 


243 


5.11.3  Using  the  Residuals 

The  constancy  of  variance  assumption  may  be  assessed  by  plotting  the  residuals,  e* 
versus  the  fitted  values  Y,;  with  a  random  scatter  suggesting  no  cause  for  concern. 
Examination  may  be  simpler  if  squared  residuals  ef  or  absolute  values  of  the 
residuals  |  |  are  plotted  versus  the  fitted  values  Y,.  These  plots  are  useful  since 

departures  from  constant  variance  often  correspond  to  a  mean-variance  relationship 
which,  given  sufficient  data  and  range  of  the  mean  function,  will  hopefully  reveal 
itself  in  these  plots.  If  the  variance  increases  with  the  mean,  plotting  e*  versus  Yt 
will  reveal  a  funnel  shape  with  the  wider  end  of  the  funnel  to  the  right  of  the  plot. 
For  the  plots  using  the  squared  or  absolute  residuals,  interpretation  may  be  improved 
with  the  addition  of  a  smoother. 

When  one  of  the  columns  of  x  represents  time,  we  may  plot  the  residuals  versus 
time  and  assess  dependence  between  error  terms.  Dependence  may  also  be  detected 
using  scatterplots  of  lagged  residuals,  for  example,  by  plotting  e,  versus  e,;_i  for 
i  =  2 Independent  residuals  should  produce  a  plot  with  a  random  scatter 
of  points.  The  autocorrelation  at  different  lags  may  also  be  estimated  for  equally 
spaced  data  in  time,  while  for  unequally  spaced  data,  a  semi-variogram  may  be 
constructed.  The  latter  is  described  in  the  context  of  longitudinal  data  in  Sect.  8.8. 

To  assess  normality  of  the  residuals,  we  may  construct  a  normal  QQ  plot.  We 
first  order  the  residuals  and  call  these  e^,  i  =  1, . . . ,  n.  The  expected  order  statistic 
of  size  n  from  a  normal  distribution  is  given  (approximately)  by 

,  f  i  —  0.5\ 

fd)  =  <l>  (^— —  J  >i= 

where  '/'(■)  is  the  cumulative  distribution  function  of  the  standard  normal  distribu¬ 
tion,  that  is,  if  Z  ~  N(0, 1)  then  $(z)  =  Pr (Z  <  z ).  We  then  plot  versus 
/(j) .  If  the  normality  assumption  is  reasonable,  the  points  should  lie  approximately 
on  a  straight  line.  If  we  plot  the  ordered  standardized  residuals  eY  versus  /(,), 
then,  in  addition,  the  line  should  have  slope  one.  Deciding  on  whether  the  points 
are  suitably  close  to  linear  is  difficult  and  may  be  aided  by  simulating  multiple 
datasets  from  which  intervals  may  be  derived  for  each  i.  Care  must  be  taken  in 
interpretation  as  (5.62)  shows  that  the  observed  residuals  are  a  linear  combination 
of  the  error  terms  and  hence  may  exhibit  supernormality,  that  is,  even  if  e,;  is  not 
normal,  Y^j=t  maY  tend  toward  normality  (and  dominate  the  first  term,  eQ. 

Figure  5.8  shows  what  we  might  expect  to  see  under  various  distributional 
assumptions.  QQ  normal  plots  for  normal.  Laplacian,  Student’s  1 3,  and  lognormal 
error  distributions  are  displayed  in  the  four  rows,  with  sample  sizes  of  n  = 
10,  25,  50,  200  across  columns.  The  characteristic  skewed  shape  of  the  lognormal 
distribution  is  revealed  for  all  sample  sizes,  but  it  is  difficult  to  distinguish 
between  the  Faplacian  and  the  normal,  even  for  a  large  sample  size.  For  small  n, 
interpretation  is  very  difficult. 
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Fig.  5.8  Normal  scores  plot  for  various  distributions  and  sample  sizes.  Columns  1-4  represent 
sample  sizes  of  10,  25,  50,  and  200.  Rows  1M  correspond  to  errors  generated  from  normal, 
Laplacian,  Student’s  <3,  and  lognormal  distributions,  respectively.  In  each  plot,  the  expected 
residuals  are  plotted  on  the  x-axis,  and  the  observed  ordered  residuals  on  the  y- axis 


In  general,  simulation  may  be  used  to  examine  the  behavior  of  plots  when 
the  model  is  true.  QQ  plots  may  be  constructed  to  assess  any  distributional 
assumption,  by  an  appropriate  choice  of  fu\ .  The  Bayesian  approach  to  inference 
allow  alternative  likelihoods  to  the  normal  to  be  fitted  relatively  easily  under  an 
MCMC  implementation.  We  have  concentrated  on  frequentist  residuals,  but  all  of 
the  above  plots  may  be  based  on  Bayesian  residuals.  For  example,  we  can  obtain 
samples  from  the  posterior  distribution  of  / 3  and  a  and  then  substitute  these  samples 
into 


Vi  -  Xi/3 

<7(1  -M1/2’ 


(5.65) 


to  produce  samples  from  the  posterior  distribution  of  the  residuals.  The  posterior 
mean  or  median  of  the  e *  can  then  be  calculated  and  examined.  More  simply,  one 
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Table  5.14  Parameter 
estimates  and  standard  errors 
(model-based  and  sandwich) 
for  the  prostate  cancer  data 


Variable 

Estimate 

Standard  error 

Model-based 

Sandwich 

log(can  vol ) 

0.59 

0.088 

0.077 

log(weight) 

0.45 

0.17 

0.19 

age 

-0.020 

0.011 

0.0094 

log(BPH) 

0.11 

0.058 

0.057 

SVI 

0.77 

0.24 

0.21 

log(cap  pen ) 

-0.11 

0.091 

0.079 

gleason 

0.045 

0.16 

0.13 

PGS45 

0.0045 

0.0044 

0.0042 

a 

0.78 

- 

- 

could  substitute  the  posterior  means  or  medians  of  (3  and  a  into  (5.65).  An  early  use 
of  Bayesian  residuals  analysis  was  provided  by  Chaloner  and  Brant  (1988). 

A  major  problem  with  residual  analysis,  unless  one  is  in  purely  exploratory 
mode,  is  that  if  the  assumptions  are  found  wanting  and  we  change  the  model,  what 
are  the  frequentist  properties  in  terms  of  bias,  the  coverage  of  intervals,  and  the 
a  level  of  tests?  Recall  the  discussion  of  Chap.  4.  To  avoid  changing  the  model, 
including  transforming  x  and/or  y,  one  should  try  and  think  as  much  as  possible 
about  a  suitable  model,  before  the  data  are  analyzed.  As  always  the  exact  procedure 
followed  should  be  reported,  so  that  inferential  summaries  can  be  more  easily 
interpreted.  The  same  problems  exist  for  a  Bayesian  analysis,  since  one  should 
specify  a  priori  all  models  that  one  envisages  fitting  (which  may  not  be  feasible 
in  advance),  with  subsequent  averaging  across  models  (Sect.  3.6). 


5.12  Example:  Prostate  Cancer 

We  return  to  the  PSA  data  and  provide  a  more  comprehensive  analysis.  We  fit  the 
full  (main  effects  only)  model 

log  PSA  =  /3o+/?i  x  log  (can  vol)  4-/32  x  log(weight)+/33  x  age+/?4  x  log(bph) 
+P5  x  svi  +  (36  x  log(cap  pen)  +  fa  x  gleason  +  /38  x  PGS45  +  e, 

with  e\o2  ~ud  N(0,u2).  The  resultant  least  squares  parameter  estimates  and 
standard  errors  are  given  in  Table  5.14.  This  table  includes  the  sandwich  standard 
errors,  to  address  the  possibility  of  nonconstant  variance  error  terms.  These  are 
virtually  identical  to  the  model-based  standard  errors.  This  is  not  surprising  given 
Fig.  5.9(a),  which  plots  the  absolute  value  of  the  residuals  against  the  fitted  values, 
and  indicates  that  the  constant  variance  assumption  appears  reasonable. 

With  7i  — k—  1  =  88,  we  do  not  require  normality  of  errors,  but  for  illustration  we 
include  a  QQ  normal  plot  in  Fig.  5.9(b)  and  see  that  the  errors  are  close  to  normal. 
Figures  5.9(c)  and  (d)  plot  the  residuals  versus  two  of  the  more  important  covariates, 
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a  b 


Fig.  5.9  Diagnostic  plots  in  the  prostate  cancer  study:  (a)  absolute  values  of  residuals  versus  fitted 
values,  with  smoother,  (b)  normal  QQ  plot  of  residuals;  (c)  residuals  versus  log  cancer  volume, 
with  smoother,  (d)  residuals  versus  log  weight,  with  smoother 


log  cancer  volume  and  log  weight,  with  smoothers  added.  In  each  case,  we  see  no 
strong  evidence  of  nonlinearity. 

We  now  discuss  a  Bayesian  analysis  of  these  data.  With  the  improper  prior  (5.42), 
we  saw  in  Sect.  5.7  that  inference  was  identical  with  the  frequentist  approach  so 
that  the  estimates  and  (model-based)  standard  errors  in  Table  5.14  are  also  posterior 
means  and  posterior  standard  deviations.  Figure  5.10  displays  the  marginal  posterior 
densities  (which  are  located  and  scaled  Student’s  t  distributions  with  88  degrees  of 
freedom)  for  the  eight  coefficients.  In  this  plot,  for  comparability,  we  scale  each  of 
the  x  variables  to  lie  on  the  range  (0,1). 

Turning  now  to  an  informative  prior  distribution,  without  more  specific  knowl¬ 
edge,  we  let  f3*  =  [/3g , . . . ,  /S^]7  represent  the  vector  of  coefficients  associated  with 
the  standardized  covariates  on  (0,1).  The  prior  is  taken  as  7t(/3*)7t(ct2)  with 


8 

^)  =  n^)  (5-66) 

i=o 

and  7t(/3q)  oc  1  (an  improper  prior).  For  the  regression  coefficients  /3J  N(0,  V ) 

with  the  standard  deviations,  y/V,  chosen  in  the  following  way.  For  the  prostate 
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Fig.  5.10  Marginal  posterior 
distributions  of  regression 
coefficients  associated  with 
the  eight  (standardized) 
covariates,  for  the  prostate 
cancer  data 


- 1 - i - 1 - i - 1 - 1 - r 

-2-101234 

Coefficient 


data,  we  believe  that  it  is  unlikely  that  any  of  the  standardized  covariates,  over  the 
range  (0,1),  will  change  the  median  PSA  by  more  than  10  units.  The  way  we  include 
this  information  in  the  prior  is  by  assuming  that  the  1.96  x  W  point  of  the  prior 
corresponds  to  the  maximum  value  we  believe  is  a  priori  plausible,  that  is,  we  set 
P*  =  log(10)  equal  to  this  point.  For  a2,  we  assume  the  improper  choice  7r(cr2)  oc 

C 7~2. 

Figure  5.11  shows  the  95%  credible  intervals  under  the  flat  and  informative 
priors,  and  we  see  the  general  shrinkage  towards  zero  (the  prior  mean).  On  average 
there  is  around  a  10%  reduction  in  the  posterior  standard  deviations  (and  hence 
the  credible  intervals)  under  the  informative  prior,  which  shows  how  the  use  of 
informative  priors  can  aid  in  the  bias-variance  trade-off.  The  above  analysis  is 
closely  related  to  ridge  regression ,  as  will  be  discussed  in  Sect.  10.5.1. 


5.13  Concluding  Remarks 

In  this  chapter  we  have  concentrated  on  the  linear  model 

Y  =  xf3  +  e 

where  /3  is  n  x  (fc+ 1)  and  e  ~  N„(0n,  a2In).  Although  the  range  of  models  that  are 
routinely  available  for  fitting  has  expanded  greatly  (see  Chaps.  6  and  7),  the  linear 
model  continues  to  be  popular.  There  are  good  reasons  for  this,  since  parameter 
interpretation  is  straightforward  and  the  estimators  commonly  used  are  linear  in  the 
data  and  therefore  possess  desirable  robustness  properties. 
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Fig.  5.11  95%  credible 
intervals  for  regression 
coefficients  corresponding  to 
standardized  covariates, 
under  flat  and  informative 
priors,  for  the  prostate  cancer 
data 
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Unless  n  is  not  large,  or  there  is  substantial  prior  information,  the  point  estimate 

(3  =  ( xTx)~1xTy 


and  100(1  —  a)%  interval  estimate 

%  ±  ^l—ot/2  x 

where  is  the  100(1  —  a/2)%  point  of  a  Student’s  t  distribution  with  n—k—  1 

degrees  of  freedom,  emerges  from  likelihood,  ordinary  least  squares,  and  Bayesian 
analyses.  These  summaries  are  robust  to  a  range  of  distributions  for  the  error  terms, 
so  long  as  n  is  large.  Nonconstant  error  variance  and  correlated  errors  can  both 
seriously  damage  the  appropriateness  of  the  interval  estimate,  however.  With  larger 
sample  sizes,  sandwich  estimation  provides  a  good  approach  for  guarding  against 
nonconstant  error  variance. 


5.14  Bibliographic  Notes 

McCullagh  and  Nelder  (1989,  Chap.  3)  provide  an  extended  discussion  on  parame¬ 
terization  issues,  including  aliasing,  and  the  interpretation  of  parameters.  For  more 
discussion  of  conditions  for  asymptotic  normality  for  simple  linear  regression,  see 
(van  der  Vaart  1998,  p.21).  Firth  (1987)  discusses  the  loss  of  precision  when  the  data 
are  not  normally  distributed  and  shows  that  the  skewness  of  the  true  distribution 
of  the  errors  is  an  important  factor.  The  theory  presented  in  Lehmann  (1986, 
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p.  209-211)  indicates  that  dependence  in  the  residuals  can  cause  real  problems  for 
estimation  of  appropriate  standard  errors.  Further  details  of  residual  analysis  may 
be  found  in  Cook  and  Weisberg  (1982). 

The  classic  frequentist  text  on  the  analysis  of  variance  is  Scheffe  (1959),  while 
Searle  et  al.  (1992)  provide  a  more  recent  treatment.  An  interesting  discussion,  from 
a  Bayesian  slant,  is  provided  by  Gelman  and  Hill  (2007,  Chap.  22). 

Numerous  texts  have  been  written  on  the  linear  model;  see,  for  example, 
Ravishanker  and  Dey  (2002)  and  Seber  and  Lee  (2003)  for  the  theory  and  Faraway 
(2004)  for  a  more  practical  slant. 


5.15  Exercises 

5.1  Consider  the  model 

Y  =  x/3  +  e, 

where  Y  is  the  n  x  1  vector  of  responses,  x  is  the  n  x  (k  +  1)  design  matrix, 
1 3  =  [/3o,  •  •  • ,  /3k],  andE[e]  =  0,  var(e)  =  cr2V  where  V  is  a  known  correlation 
matrix  V. 

(a)  By  considering  the  sum  of  squares, 

RSSV  =  (Y-  ®/3)TV-1(F  -  x(3). 
show  that  the  generalized  least  squares  estimator  is 
3V  =  (a '7V-1x)-1x1V~1Y, 

provided  the  necessary  inverse  exists. 

(b)  Derive  the  distribution  of  /3V. 

(c)  Show  that  a2,  as  defined  in  (5.33),  is  an  unbiased  estimator  of  a2. 

5.2  Suppose  (3l  /  (32  are  two  different  least  squares  estimates  of  (3.  Show  there 
are  infinitely  many  least  squares  estimates  of  f3. 

5.3  Let  Yj  =  (30  +  /3iXi  +  e,;,  i  =  1, . . . ,  n,  where  E[e,]  =  0,  var(ei)  =  a2  and 
co v(ej,  Cj)  =  0  for  i  y  j.  Prove  that  the  least  squares  estimates  of  /3q  and  /3i 
are  uncorrelated  if  and  only  if  a;  =  0. 

5.4  Consider  the  simple  linear  regression  model 


Yi  —  f3  o  +  /3\Xi  +  et, 

with  et  |  a2  ~ad  N(0,  cr2),  i  =  1, . . . ,  n.  Suppose  the  prior  distribution  is  of 
the  form 

-2 


7r(/30,/3i,cr2)  =  tt(/30,/3i)  x  a 


(5.67) 
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where  the  prior  for  [Ao,  At]  is 


Ao 

.At. 


N2 


too 

mi 


foo  t'oi 
l>01  Un 


In  this  exercise  the  conditional  distributions  required  for  Gibbs  sampling 
(Sect.  3.8.4)  will  be  derived. 

(a)  Write  down  the  form  of  the  posterior  distribution  (up  to  proportionality) 
and  derive  the  conditional  distributions  p(Ao  |  At,  <r2,  y),  p(/3i  \  (3o,fJ2,y), 
and  i>(a2  |  Ao,  At,  v)-  Hence,  give  details  of  the  Gibbs  sampling  algorithm. 

(b)  Another  blocked  Gibbs  sampling  algorithm  (Sect.  3.8.6)  would  simulate 
from  the  distributions  p(/3  \a2,y)  and  p{a2  \  (3,y).  Derive  these  distribu¬ 
tions,  given  in  (5.46)  and  (5.47),  and  hence  describe  the  form  of  the  Gibbs 
sampling  algorithm. 

5.5  The  algorithm  derived  in  Exercise  5.4(b)  will  now  be  implemented  for  the 
prostate  cancer  data  of  Sect.  1.3.1.  These  data  are  available  in  the  R  package 
lasso2  and  are  named  Prostate.  Take  Y  as  log  prostate  specific  antigen 
and  x  as  log  cancer  volume.  Implement  the  blocked  Gibbs  sampling  algorithm 
using  the  prior  (5.67),  with  m0  =  mi  =  0,  uoo  =  Du  =  2,  and  u0i  =  0.  Run 
two  chains,  one  with  starting  values  corresponding  to  the  unbiased  estimates  of 
the  parameters  and  one  starting  from  a  point  randomly  generated  from  the  prior 
7t(Ao,  Ai)-  Report: 

(a)  Histogram  representations  of  the  univariate  marginal  distributions  p(Ao  I 
y),  p(Ai  |  y),  and  p(cr  \  y)  and  scatterplots  of  the  bivariate  marginal 
distributions  p(Ao,  At  I  y),p(/3<h<7  |  y),  and  p(Ai,  cr  |  y). 

(b)  The  posterior  means,  standard  deviations,  and  10%,  50%,  90%  quantiles 
for  Ao,  Ai>  and  a- 

(c)  Pr(Ai  >  0.5  |  y). 

(d)  lustify  your  choice  of  “burn-in"  period  (Sect.  3.8.6).  For  example,  you  may 
present  the  trace  plots  Ao*'* ,  Ao*'*  >  log  versus  t. 

(e)  Confirm  the  results  you  have  obtained  using  INLA  or  WinBUGS. 

5.6  In  this  question,  parameter  interpretation  will  be  considered.  Consider  a 
continuous  univariate  response  y,  with  two  potential  covariates,  a  continuous 
variable  xi,  and  a  binary  factor  x-i  ■  The  x  variables  will  be  referred  to  as  age 
and  gender,  respectively.  Consider  the  four  models: 

Model  A 

f  0o  +  e,  for  men  (x2  =  0) 

\0i+e,  for  women  (x2  =  1). 

Model  B 


y  =  90  +  dxxx  +  e. 
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Model  C 

_  J  90  +  0\X\  +  e,  for  men  {x2  =  0) 

^  \  02  +  9\X\  +  e,  for  women  ( x2  =  1). 

Model  D 

_  J  6*o  +  9\X\  +  e,  for  men  ( x2  =  0) 

y  X  (0o  +  4>o)  +  ®\X\  +  e,  for  women  (x2  =  1). 

Model  E 

_  J  90  +  d\X\  +  e,  for  men  (x2  =  0),  and 
^  \  do  +  02x\  +  e,  for  women  (x2  =  1). 

Model  F 

_  J  6q  +  9\X\  +  e,  for  men  {x2  =  0), 

^  \  d2  +  93Xi  +  e,  for  women  (x2  =  1). 

For  each  model,  the  error  terms  e  are  assumed  to  have  zero  mean. 

(a)  For  each  model,  provide  a  careful  interpretation  of  the  parameters  and  give 
a  description  of  the  assumed  form  of  the  relationship. 

(b)  Which  of  the  above  models  are  equivalent? 

5.7  Fet  Yi , . . . ,  Yn  be  distributed  as  Y  |  9 ,  a2  ~ind  N (i6,  i2a2)  for  i  =  1, . . . ,  n. 
Find  the  generalized  least  squares  estimate  of  9  and  prove  that  the  variance  of 
this  estimate  is  a2 /n. 

5.8  Suppose  that  the  design  matrix  x  of  dimension  n  x  (fc  +  1)  has  rank  k  +  1  and 
let  h  =  x(xTx)~1xT  represent  the  hat  matrix.  Show  that  tr (h)  =  (fc  +  1). 

5.9  Consider  the  model 

Vi  =  Po  +  PiXi  +  6i 


for  i  =  1 , ,n,  where 


to  give 


Xi 

£i 


'Vi' 

~n2  ( 

My 

&y  &xy 

) 

_Xi_ 

V 

.  f-^X  . 

axy  cr2 

J 

where  a2  =  /3fal  +  a2,  ny  =  /30  +  /3i^,x  and  axy  =  /3i o^. 
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(a)  Derive  E[K(  |  x-i\  and  var(Yi  |  x»). 

Now  suppose  one  does  not  observe  X{,  i  =  1 , ,n  but  instead  Wi  = 
Xi  +  Ui  where 


'Xi' 

( 

o 

o 

(M  H 

b 

\ 

~  n3  | 

0 

5 

0  <j1  0 

.Ui. 

\ 

.  0  . 

.  0  0  cre2_ 

J 

Assume  that  Y,t  is  conditionally  independent  of  Wt ,  that  is,  E  [Yl  \  X{,  Ui\  = 
E[Yi  |  Xi\.  Suppose  the  true  model  is  E[Yt  \  Xi\  =  /3q  +  /3i Xi  but  the 
observed  data  are  [wi,  yi],  i  =  1 , ,n. 

(b)  Relate  E [Yi  \  Wi]  to  E [xi  \  ttij]. 

(c)  What  is  the  joint  distribution  of  Xt  and  Wt  and  what  is  E[Xi  \ 

(d)  Using  your  answers  to  (b)  and  (c),  show  that  E\Y.l:  \  Wi\  =  /3q  + 

(e)  What  is  the  relationship  between  /3q,  and  /3q,  /?i? 


Chapter  6 

General  Regression  Models 


6.1  Introduction 

In  this  chapter  we  consider  the  analysis  of  data  that  are  not  well-modeled  by  the 
linear  models  described  in  Chap.  5.  We  continue  to  assume  that  the  responses  are 
(conditionally)  independent.  We  describe  two  model  classes,  generalized  linear 
models  (GLMs)  and  what  we  refer  to  as  nonlinear  models.  In  the  latter,  a  response 
Y  is  assumed  to  be  of  the  form  Y  =  jiix.  (3)  +  e  with  / i(x ,  / 3 )  nonlinear  in  x  and 
the  errors  e  independent  with  zero  mean. 

In  Sect.  6.2  we  introduce  a  motivating  pharmacokinetic  dataset  that  we  will 
subsequently  analyze  using  both  GLMs  and  nonlinear  models.  Section  6.3  considers 
GLMs,  which  were  introduced  as  an  extension  to  linear  models  and  have  received 
considerable  attention  due  to  their  computational  and  mathematical  convenience. 
While  computational  advances  have  unshackled  the  statistician  from  the  need  to 
restrict  attention  to  GLMs,  they  still  provide  an  extremely  useful  class.  Parameter 
interpretation  for  GLMs  is  discussed  in  Sect.  6.4.  Sections  6.5,  6.6,  6.7,  and  6.8 
describe,  respectively,  likelihood  inference,  quasi-likelihood  inference,  sandwich 
estimation,  and  Bayesian  inference  for  the  GLM.  Section  6.9  considers  the  assess¬ 
ment  of  the  assumptions  required  for  reliable  inference  in  GLMs.  In  Sect.  6.10,  we 
introduce  nonlinear  regression  models,  with  identifiability  discussed  in  Sect.  6.11. 
We  then  describe  likelihood  and  least  squares  approaches  to  inference  in  Sects.  6.12 
and  6.13  and  sandwich  estimation  in  Sect.  6. 14.  A  geometrical  comparison  of  linear 
and  nonlinear  least  squares  is  provided  in  Sect.  6.15.  Bayesian  inference  is  described 
in  Sect.  6.16  and  Sect.  6.17  concentrates  on  the  examination  of  assumptions. 
Concluding  comments  appear  in  Sect.  6.18  with  bibliographic  notes  in  Sect.  6.19. 

In  Chap.  7  we  discuss  models  for  binary  data;  models  for  such  data  could  have 
been  included  in  this  chapter  but  are  considered  separately  since  there  are  a  number 
of  wrinkles  that  deserve  specific  attention. 


J.  Wakefield,  Bayesian  and  Frequentist  Regression  Methods,  Springer  Series 
in  Statistics,  DOI  10.1007/978-l-4419-0925-l_6, 

©  Springer  Science+Business  Media  New  York  2013 
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6.2  Motivating  Example:  Pharmacokinetics  of  Theophylline 


In  Table  1 .2  we  displayed  pharmacokinetic  data  on  the  sampling  times  and  measured 
concentrations  of  the  drug  theophylline,  collected  from  a  subject  who  received  an 
oral  dose  of  4.53  mg/kg.  These  data  are  plotted  in  Fig.  6.1,  along  with  fitted  curves 
from  various  approaches  to  modeling  that  we  describe  subsequently.  We  will  fit 
both  a  nonlinear  (so-called,  compartmental)  model  to  these  data  and  a  GLM.  Let  x, 
and  y.j  represent  the  sampling  time  and  concentration  in  sample  i,  respectively,  for 
i  =  1, . . . ,  n  =  10. 

In  Sect.  1.3.4,  we  detailed  the  aims  of  a  pharmacokinetic  study  and  described  in 
some  detail  compartmental  models  that  have  been  successfully  used  for  modeling 
concentration-time  data.  Let  n(x)  represent  the  deterministic  model  relating  the 
response  to  time,  x\  fi(x)  will  usually  be  the  mean  response,  though  may  correspond 
to  the  median  response,  depending  on  the  assumed  error  structure.  Notationally 
we  have  suppressed  the  dependence  of  n(x)  on  unknown  parameters.  For  the  data 
considered  here,  a  starting  point  for  fi{x)  is 

Dk 

M X )  =  ttt, — [exp(-fcex)  -  exp(-fcax)]  (6.1) 

V(ka-ke) 

where  ka  >  0  is  the  absorption  rate  constant,  ke>  0  is  the  elimination  rate  constant, 
and  V  >  0  is  the  (apparent)  volume  of  distribution  (that  converts  total  amount 
of  drug  into  concentration).  This  model  was  motivated  in  Sect.  1.3.4.  A  stochastic 
component  may  be  added  to  (6.1)  in  a  variety  of  ways,  but  one  simple  approach 
is  via 


y(x)  =  /j,(x)  +  6{  x), 


(6.2) 


Fig.  6.1  Theophylline  data, 
along  with  fitted  curves  under 
various  models  and  inferential 
approaches.  Four  curves  are 
included,  corresponding  to 
MLE  and  Bayes  analyses  of 
GLM  and  nonlinear  models. 
The  two  nonlinear  curves  are 
indistinguishable 


CO 

E 


o 

O 


i 


-  GLM  MLE 

GLM  BAYES 

-  Nonlinear  MLE 

- Nonlinear  BAYES 


0  5  10  15  20  25 


Time  (hours) 


6.2  Motivating  Example:  Pharmacokinetics  of  Theophylline 


255 


where  E[5(:r)]  =  0  and  var[5(a;)]  =  a2  p(x)2  with  S(x)  at  different  times  x 
being  independent.  The  variance  model  produces  a  constant  coefficient  of  variation 
(defined  as  the  ratio  of  the  standard  deviation  to  the  mean),  which  is  often  observed 
in  practice  for  pharmacokinetic  data.  Combining  (6.1)  and  (6.2)  gives  an  example 
of  a  three  parameter  nonlinear  model.  An  approximately  constant  coefficient  of 
variation  can  also  be  achieved  by  taking 

logy  (a;)  =  logp{x)  +  e(x), 

with  E[e(a:)]  =  0  and  var[e(a;)]  =  cr2.  In  this  case,  p(x)  represents  the  median 
concentration  at  time  x  (Sect.  5.5.3). 

Model  (6.1)  is  sometimes  known  as  the  flip-flop  model,  because  there  is  an 
identifiability  problem  in  that  the  same  curve  is  achieved  with  each  of  the  parameter 
sets  [V,ka,ke]  and  [Vke/ka,  ke,  ka\.  Recall  from  Sect.  2.4.1  that  identifiability  is 
required  for  consistency  and  asymptotic  normality  of  the  MLE.  Often,  identifiability 
is  achieved  by  enforcing  ka  >  ke  >  0,  since  the  absorption  rate  is  greater 
than  the  elimination  rate  for  most  drugs.  Such  identifiability  issues  are  not  a  rare 
phenomenon  for  nonlinear  models,  and  will  receive  further  attention  in  Sect.  6.1 1. 

Model  (6.1)  may  be  written  in  the  alternative  form 

Dk 

M x )  =  v  ^  k  [exp (~kex)  -  exp (~kax)\ 

=  exp(/30  +  fox)  {1  -  exp[— (fca  -  ke)x}}  ,  (6.3) 

where  flo  =  log [Dka/V(ka  —  ke)\  and  /3i  =  —  ke.  As  an  alternative  to  the 
compartmental  model,  (6.1),  we  will  also  consider  the  fractional  polynomial  model 
(as  introduced  by  Nelder  1966)  given  by 

p{x)  =  exp  {fio  +  fl\x  +  fiijx) .  (6.4) 

Comparison  with  (6.3)  shows  that  /?2  is  the  parameter  that  is  determining  the  absorp¬ 
tion  phase.  This  model  only  makes  sense  if  it  produces  both  an  increasing  absorption 
phase  and  a  decreasing  elimination  phase,  which  correspond,  retrospectively,  to 
/?2  <  0  and  <  0.  When  combined  with  an  appropriate  choice  for  the  stochastic 
component,  model  (6.4)  falls  within  the  GLM  class,  as  we  see  shortly. 

In  a  pharmacokinetic  study,  as  discussed  in  Sect.  1.3.4,  interest  often  focuses 
on  certain  derived  parameters.  Of  specific  interest  are  X\/2,  the  elimination  half- 
life,  which  is  the  time  it  takes  for  the  drug  concentration  to  drop  by  50%  (for 
times  sufficiently  large  for  elimination  to  be  the  dominant  process);  irnax,  the  time 
to  maximum  concentration;  p{xmax),  the  maximum  concentration;  and  Cl,  the 
clearance,  which  is  the  amount  of  blood  cleared  of  drug  in  unit  time. 

With  respect  to  model  (6.1),  the  derived  parameters  of  interest,  in  terms  of 
[V,  ka ,  ke],  are 


256 


6  General  Regression  Models 


xl/2 

■^max 


/i(Xmax) 


Cl 


log  2 

ke 


Dkg 

V (ka  -  ke ) 


[exp(  keX  max)  exp(  kaX  max)] 


D  /  kg\ka/{ka~ke) 

V  V k~e) 

D 

AUC 

V  x  ke 


where  AUC  is  the  area  under  the  concentration-time  curve  between  0  and  oo.  With 
respect  to  model  (6.4),  as  functions  of  (3  =  [/30)  Pi,  P2], 


X1/2  = 


log  2 

‘  Pi 


-m 


1/2 


M(^max)  =  -Dexp  /?o  -  2(/3ip2)1/2 

\J  P1/P2 


Cl  = 


2  exp(/?0)/\  r  [2(/31/32)1/2] : 


(6.5) 


where  Ks(x)  denotes  a  modified  Bessel  function  of  the  second  kind  of  order  s. 
Consequently,  for  both  models,  the  quantities  of  interest  are  nonlinear  functions  of 
the  original  parameters,  which  has  implications  for  inference. 


6.3  Generalized  Linear  Models 


Generalized  linear  models  (GLMs)  were  introduced  by  Nelder  and  Wedderburn 
(1972)  and  provide  a  class  with  relatively  broad  applicability  and  desirable  statistical 
properties.  For  a  GLM: 


•  The  responses  yi  follow  an  exponential  family,  so  that  the  distribution  is  of 
the  form 


ViOi  -  b(0i) 


+  c(yi,  a) 


p(Vi  \  Oi,  a)  =  exp 


a 


(6.6) 


6.3  Generalized  Linear  Models 


257 


Table  6.1  Characteristics  of  some  common  GLMs.  The  notation  is  as  in  (6.6).  The  canonical 
parameter  is  9,  the  mean  is  E[Y]  =  fi,  and  the  variance  is  var(Y)  =  aV (/ 1 ) 


Distribution 

N  (/r,a2) 

Poisson(/r) 

Bemoulli(/r) 

Ga(l/a,  l/[/ia:]) 

Mean  E[Y  |  0] 
Variance  V (fi) 

m 

c{y,  a) 

e 

l 

02/2 

|\  +  l°g(27ro0] 

exp(0) 

H 

exp  (0) 

-  log  y\ 

exp(0) 

1 

l+exp(0) 

m(1  -  h) 

log(l  +  e8) 

1 

e 

n2 

-  l°g(-0) 

g(„/  )  log?/  +  log r(a) 

for  functions  &(•),  c(-,  •)  and  where  9,  and  a  are  scalars.  It  is  straightforward  to 
show  (using  the  results  of  Sect.  2.4)  that 

E [Yi  |  9i,a]  =  m 

=  b'{di) 


and 


var (Yi  |  0i,a)  =  ab”{9l) 

=  aV{ni), 

for  i  =  1  We  assume  cov(K;, Yj  |  9i,9j,a )  =  0,  for  i  ^  j  (Chap. 9 

provides  the  extension  to  dependent  data). 

•  A  link  function  g(-)  provides  the  connection  between  the  mean  function  Hi  = 
E  [Yi  |  9i,  a]  and  the  linear  predictor  Xi/3  via 

(/(/A)  =  Wifi-, 

where  Xi  is  a  (k  +  1)  x  1  vector  of  explanatory  variables  (including  a  1  for 
the  intercept)  and  (3  =  [/3q,  j3i, . . . ,  f3k]T  is  a  (k  +  1)  x  1  vector  of  regression 
parameters. 

To  summarize,  a  GLM  assumes  a  linear  relationship  on  a  transformed  mean  scale 
(which,  as  we  shall  see,  offers  certain  computational  and  statistical  advantages)  and 
an  exponential  family  form  for  the  distribution  of  the  response. 

If  a  is  known,  then  (6.6)  is  a  one-parameter  exponential  family  model.  If  a  is 
unknown,  then  the  distribution  may  or  may  not  be  a  two-parameter  exponential 
family  model.  So-called  canonical  links  have  9i  =  x,fi  and  provide  simplifications 
in  terms  of  computation. 

GLMs  are  very  useful  pedagogically  since  they  separate  the  deterministic  and 
stochastic  components  of  the  model,  and  this  aspect  was  emphasized  in  the  abstract 
of  Nelder  and  Wedderburn  (1972):  “The  implications  of  the  approach  in  designing 
statistics  courses  are  discussed.” 

Table  6.1,  adapted  from  Table  2.1  of  McCullagh  and  Nelder  (1989),  characterizes 
a  number  of  common  GLMs.  Another  example  which  is  not  listed  in  the  table,  is  the 
inverse  Gaussian  distribution;  Exercise  6.1  derives  the  detail  for  this  case. 
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Example:  Pharmacokinetics  of  Theophylline 


Model  (6.3)  is  an  example  of  a  GLM  with  a  log  link: 


log/i(£c)  =  /30  +  PiXi  +  P2X2 


(6.7) 


where  x  =  [1,  x\,  X2]  and  X2  =  l/a-’i- 

Turning  to  the  stochastic  component,  as  noted  in  Sect.  6.2,  the  error  terms  often 
display  a  constant  coefficient  of  variation.  With  this  in  mind,  we  may  combine  (6.7) 
with  a  gamma  distribution  via 


Y(x)\ /3,a~indG&{a  1  fn(x)a\  1}, 


(6.8) 


to  give  E[Y(a:)]  =  n(x)  and  var[Y(a:)]  =  a/i(x)2  so  that  a1/2  is  the  coefficient 
of  variation.  Lindsey  et  al.  (2000)  examine  various  distributional  choices  for 
pharmacokinetic  data  and  found  the  gamma  assumption  to  be  reasonable  in  their 
examples.  It  is  interesting  to  note  that  for  the  gamma  distribution,  the  reciprocal 
transform  is  the  canonical  link,  but  this  option  is  not  statistically  appealing  since  it 
does  not  constrain  the  mean  function  to  be  positive.  In  the  pharmacokinetic  context 
the  reciprocal  link  also  results  in  a  concentration-time  curve  that  is  not  integrable 
between  0  and  00  so  that  the  fundamental  clearance  parameter  is  undefined.  One 
disadvantage  of  the  loglinear  GLM  defined  above,  compared  to  the  nonlinear 
compartmental  model  we  discuss  later,  is  that  if  multiple  doses  are  considered,  the 
mean  function  does  not  correspond  to  a  GLM. 


Example:  Lung  Cancer  and  Radon 

In  Sect.  1.3.3  we  described  data  on  lung  cancer  incidence  in  counties  in  Minnesota, 
with  Yi  the  number  of  cases,  £,  the  average  radon,  and  13,  the  expected  number 
of  cases,  in  area  i,  i  =  1, . . . ,  n.  These  data  were  examined  repeatedly  in  Chaps.  2 
and  3. 

A  starting  model  is  Yj  |  (3  ~ind  Poisson  [Ei  exp(/3o  +  (3\Xi)\,  which  we  write  as 


log  Pr( Y  =  yz\  (3)  =  Vi  logm  -  m  -  log  t/J 


with  log  Hi  =  logE)  +  /3o  +  PiXi,  to  give  a  GLM  with  a  (canonical)  log  link.  As 
discussed  in  Chaps.  2  and  3,  this  model  is  fundamentally  inadequate  because  a  =  1, 
and  so  there  is  no  parameter  to  allow  for  excess-Poisson  variation.  The  latter  can 
be  modeled  using  the  negative  binomial  model  of  Sect.  6.3  or  the  quasi-likelihood 
approach  described  in  Sect.  6.6. 

With  unknown  scale  parameter,  the  negative  binomial  is  not  a  GLM.  We  consider 
the  case  of  known  b  (which  will  rarely  be  of  interest  in  a  practical  setting).  For 
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consistency  with  its  use  in  Chap.  2,  we  label  the  scale  parameter  of  the  negative 
binomial  model  as  b.  In  the  following,  care  should  therefore  be  taken  to  discriminate 
between  b(-),  as  in  (6.6),  and  the  scale  parameter,  b.  From  (2.40), 


log  Pr(F  =  yi\fM)=b 


Vib  log 


A H 


-  b2  log (fM  +  b) 


Hi  +  &, 

+  log r(yi  +  b)  -  log  T(b)  -  log yi\ -  b(b+  1)  log  b 


which  is  of  the  form  (6.6)  with 


0i  =  b  log 


Hi 


b(9i )  =  b2  log(/Zj  +  b), 

c(yi,b)  =  log r(yz  +  b)  -  log.T(&)  -  logy*!  -b(b+l)  log b, 


so  that 

E  [Yz  |  Hi]  =  Hi  —  b\0i) 

b^i/h 

~  1  —  Qpi/b  ’ 

var(F;  |  Hi)  =  b  x  b" {9 i) 

=  Hi  +  Hi/b- 

The  canonical  link  is 

''•  =  i’log(^)=x'3' 

which  depends  on  b.  The  negative  binomial  distribution  is  described  in  detail  by 
Cameron  and  Trivedi  (1998). 


6.4  Parameter  Interpretation 

Interpretation  of  the  regression  parameters  in  a  GLM  is  link  function  specific.  The 
linear  link  was  discussed  in  Chap.  5,  and  the  log  link  was  considered  repeatedly 
(in  the  context  of  the  lung  cancer  and  radon  data)  in  Chaps.  2  and  3.  We  provide  an 
interpretation  of  binary  data  link  functions,  such  as  the  logistic,  in  Chap.  7.  Linearity 
on  some  scale  offers  advantages,  as  illustrated  by  the  following  example. 

Consider  the  log  linear  model: 


log  h{x)  =  Po+PlXl  +/?2 X2- 
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The  parameter  exp(/3i)  has  a  relatively  straightforward  interpretation,  being  the 
multiplicative  change  in  the  average  response  associated  with  a  one-unit  increase 
in  xi,  with  x-i  held  constant. 

In  contrast,  for  general  nonlinear  models,  the  parameters  often  define  particular 
functions  of  the  response  covariate  curve  or  fundamental  quantities  that  define  the 
system  under  study.  We  saw  an  example  of  this  in  Sect.  6.2,  in  which  the  nonlinear 
concentration-time  curve  (6.1)  was  defined  in  terms  of  the  volume  of  distribution  V 
and  the  absorption  and  elimination  rate  constants  ka  and  ke.  Alternatively,  we  could 
define  the  model  in  terms  of  characteristics  of  the  curve,  for  example,  the  half-life, 
X1/2,  the  time  to  maximum  concentration,  xni;ix,  and  the  maximum  concentration, 
/i(x'max).  We  now  discuss  inference  for  the  GLM. 


6.5  Likelihood  Inference  for  GLMs 
6.5.1  Estimation 

We  first  derive  the  score  vector  and  information  matrix.  For  an  independent  sample 
from  the  exponential  family  (6.6) 


where  9  =  9(f3)  =  [0i(/3), . . . ,  8n((3)\  is  the  vector  of  canonical  parameters.  Using 
the  chain  rule,  the  score  function  is 


where  var(l^  |  (3)  =  aVi  and 


d2b  dfii 

~M\=~dBi 


for  i  =  1, . . . ,  n.  Hence, 


D^V1  [Y  -  M(/3)]  /a, 


(6.10) 
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where  D  is  the  n  x  (fc  +  1)  matrix  with  elements  dn-jJdQj,  i  =  1 
j  =  0 , ,k,  and  V  is  the  n  x  n  diagonal  matrix  with  /'th  diagonal  element  V, . 
Consequently,  an  estimator  (3n  defined  through  S(j3n)  =  0  will  be  consistent  so 
long  as  the  mean  function  is  correctly  specified,  since  the  estimating  function  is 
unbiased  in  this  case.  For  canonical  links,  for  which  0;  =  Xi/3, 


^  dh  _  -A  dh  89, 

^  8f3  ~  ^  dBi~dp 
2=1  2=1 


WxllYi-M 

a  z J 


so  that  the  sufficient  statistics 


i=l  i= 1 


are  recovered  at  the  MLE,  /3. 

From  Result  2.1,  the  MLE  has  asymptotic  distribution 

Nfc+1(0,Ifc+1), 

where  the  expected  information  is 

In(l 3)  =  E[S(/3)S(/3)T]  =  DTV~1D/ct. 


In  practice  we  use 

In(  3j  =  DTV~1D/a, 

where  V  and  D  are  evaluated  at  (3n.  The  variance  of  the  estimator  is 

w(/3)  =  a  (dtV~1d')  1  (6.11) 

and  is  consistently  estimated  if  the  second  moment  is  correctly  specified. 

The  information  matrix  may  be  written  in  a  particularly  simple  and  useful  form, 
as  we  now  show.  We  first  let  rji  =  g(m)  denote  the  linear  predictor.  The  score,  (6.9), 
may  be  written,  for  parameter  j,  j  =  0, 1, . . . ,  k,  as 


-Si  09) 


81  _  tA  (Yj  -  fj,j)  dfj,j  di-jj 
dfy  aVi  drji  dfy 

E{Yi  ~  Hi)  d[J>i 

aVi  d^Xij' 


(6.12) 
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Hence,  element  (J,  f )  of  the  expected  information  is 


-5> 


dh 

dP? 


E* 


( Yi,  -  fn)xij  dm  ( Yi  -  m)xij>  dm 
aVi  drji  aVi  difr 


E 


*&ij  % ij ' 

~aVT 


The  information  matrix  therefore  takes  the  form 


I(/3)  =  xTW(/3)x 


(6.13) 


where  W  is  the  diagonal  matrix  with  elements 

(■ dm/drii )2 

u>i  =  - 77 - , 

aVi 

i  =  1, . . . ,  n. 

When  a  is  unknown,  it  may  be  estimated  using  maximum  likelihood  or  the 
method  of  moments  estimator 


i  (Yi  —  m)2 

n  —  k  —  1  V(m) 

1=1 


(6.14) 


where  m  =  m((3).  Section  2.5  contained  the  justification  for  this  estimator,  which 
has  the  advantage  of  being,  in  general,  a  consistent  estimator  in  a  broader  range  of 
circumstances  than  the  MLE.  The  method  of  moments  approach  is  routinely  used  for 
normal  and  gamma  data.  As  usual,  there  will  be  an  efficiency  loss  when  compared 
to  the  use  of  the  MLE  if  the  distribution  underlying  the  derivation  of  the  latter  is 
“true.” 

The  use  of  (6.10)  is  appealing  since  it  depends  on  only  the  first  two  moments 
so  that  consistency  of  (3n  does  not  depend  on  the  distribution  of  the  data.  Accurate 
asymptotic  confidence  interval  coverage  depends  only  on  correct  specification  of  the 
mean-variance  relationship.  Section  6.7  describes  how  the  latter  requirement  may 
be  relaxed. 

If  the  score  is  of  the  form  (6.6),  that  is,  if  the  score  arises  from  an  exponential 
family,  it  is  not  necessary  to  have  a  mean  function  of  GLM  form  (i.e.,  a  linear 
predictor  on  some  scale).  So,  for  example,  the  nonlinear  models  considered  later 
in  the  chapter,  when  embedded  within  an  exponential  family,  also  share  consistency 
of  estimation  (so  long  as  regularity  conditions  are  satisfied). 
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6.5.2  Computation 


Computation  is  relatively  straightforward  for  GLMs,  since  the  form  of  a  GLM  yields 
a  log-likelihood  surface  that  is  well  behaved,  for  all  but  pathological  datasets.  In 
particular,  a  variant  of  the  Newton-Raphson  method  (a  generic  method  for  root¬ 
finding),  known  as  Fisher  scoring,  may  be  used  to  find  the  MLEs.  We  briefly  digress 
to  describe  the  Newton-Raphson  method.  Let  S((3)  represent  apx  1  vector  of 
functions  that  are  themselves  functions  of  a  p  x  1  vector  (3 .  We  wish  to  find  (3  such 
that  S(/3)  =  0.  A  first-order  Taylor  series  expansion  about  /3^  gives 

S((3)  »  S{(3{0))  +  ((3-  (3 (0))TS"(/3(0)). 

Setting  the  left-hand  side  to  zero  yields 

/3  =  /3(0)-5'(/3(0))"1S(/3(0)). 

The  Newton-Raphson  method  iterates  the  step: 

(3 (t+1)  =  (3{t)  -  S'(/3{t))~1S(l3(t)), 

for  t  =  0, 1,2, . . .  The  Fisher  scoring  method  is  the  Newton-Raphson  method 
applied  to  the  score  equation,  but  with  the  observed  information,  S' ((3),  replaced 
by  the  expected  information  E[S'(/3)]  =  —I(/3)  to  give 

/3(t+1)  =  (3W  + 


so  that  a  new  estimate  is  calculated  based  on  the  score  and  information  evaluated  at 
the  previous  estimate.  Recall  that  for  a  GLM,  I ((3)  =  xTW (/3)x.  Using  this  form, 
and  (6.12),  we  write 


/3(t+1)  =  (xTyWi)-1*^)  \x(3w  +  (wW)-^ 
=  (xTWwx)-1xTWwz® 

where  and  are  nxl  vectors  with  ?'th  elements 


(t)  _  (Yj  -  dpi 
U%  aV}t]  dVi 


pW 


and 


*<*>=  ^ 


3(*) 


(6.15) 
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Table  6.2  Point  and  90%  interval  estimates  for  the  theophylline  data  of  Table  1.2,  under  various 
models  and  estimation  techniques.  CV  is  the  coefficient  of  variation  and  is  expressed  as  a 
percentage.  The  Bayesian  point  estimates  correspond  to  the  posterior  medians 


Model 

®l/2 

PP  max 

/i(^max) 

CV  (xlOO) 

GLM  MLE 

7.23 

[6.89,7.59] 

1.60 

[1.52,1.69] 

8.25 

[7.95,8.56] 

4.38 

[3.04,6.33] 

GLM  sandwich 

7.23 

[6.97,7.50] 

1.60 

[1.57,1.64] 

8.25 

[8.02,8.48] 

4.38 

[3.04,6.33] 

Nonlinear  MLE 

7.54 

[7.09,8.01] 

1.51 

[1.36,1.66] 

8.59 

[7.99,9.24] 

6.32 

[4.38,9.13] 

Nonlinear  sandwich 

7.54 

[7.11,7.98] 

1.51 

[1.43,1.58] 

8.59 

[8.11,9.10] 

6.32 

[4.38,9.13] 

Prior 

8.00 

[5.30,12.01 

1.50 

[0.75,3.00] 

9.00 

[6.80,12.0] 

5.00 

[2.50,10.0] 

GLM  Bayes 

7.26 

[6.93,7.74] 

1.60 

[1.51,1.68] 

8.24 

[7.89,8.54] 

5.21 

[3.72,7.86] 

Nonlinear  Bayes 

7.57 

[7.15,8.04] 

1.50 

[1.36,1.66] 

8.59 

[8.22,8.94] 

6.01 

[4.34,8.93] 

respectively.  The  Fisher  scoring  updates  (6.15)  therefore  have  the  form  of  a 
weighted  least  squares  solution  to 

(z(t)  -  xf3)TW(t\zw  -  x(3)  (6.16) 

with  “working”  or  “adjusted”  response  z^.  This  method  is  therefore  known  as 
iteratively  reweighted  least  squares  (1RLS).  For  canonical  links,  the  observed  and 
expected  information  coincide  so  that  the  Fisher  scoring  and  Newton-Raphson 
methods  are  identical. 

The  existence  and  uniqueness  of  estimates  have  been  considered  by  a  number  of 
authors;  early  references  are  Wedderburn  (1976)  and  Haberman  (1977). 


Example:  Pharmacokinetics  of  Theophylline 

Fitting  the  gamma  model  (6.8)  with  mean  function  (6.7)  gives  MLEs  for  [/3q.  Pi,  P2] 
of  [2.42,  —0.0959,  —0.246].  The  fitted  curve  is  shown  in  Fig.  6.1.  The  method  of 
moments  estimate  of  the  coefficient  of  variation,  lOOfa,  is  5.3%,  while  the  MLE  is 
4.4%.  Asymptotic  standard  errors  for  [/?o,  Pi,P-2[,  based  on  the  method  of  moments 
estimator  for  a,  are  [0.033,  0.0028,  0.018].  The  point  estimates  of  f3  are  identical, 
regardless  of  the  estimate  used  for  a,  because  the  root  of  the  score  is  independent  of 
a  in  a  GLM,  as  is  clear  from  (6.10). 

The  top  row  of  Table  6.2  gives  MLEs  for  the  derived  parameters,  along  with 
asymptotic  90%  confidence  intervals,  derived  using  the  delta  method.  All  are  based 
upon  the  method  of  moments  estimator  for  a.  The  parameters  of  interest  are  all 
positive,  and  so  the  intervals  were  obtained  on  the  log  scale  and  then  exponentiated. 
Deriving  an  interval  estimate  for  the  clearance  parameter  using  the  delta  method  is 
more  complex.  Working  with  9  =  logCZ,  we  have 


Do 

Dv 

D2 


var (9)  =  [D0  Dx  D2]V' 
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where,  from  (6.5), 


A>  = 

Di  = 

D2  = 


dO 

W 

80 

w 

dO 

w 


1 

1  /feA0(2Vftfe) 

a  V  & 

[faK0(2y/fcfc) 

V  &  Ri  (2VTO  ’ 


and  V*  is  the  variance-covariance  matrix  of  /3  as  given  by  (6.11).  For  the 
theophylline  data,  the  MLE  is  Cl  =  0.042  with  asymptotic  90%  confidence 
interval  [0.041,0.044].  Inference  for  the  clearance  parameter  using  the  sampling- 
based  Bayesian  approach  that  we  describe  shortly  is  straightforward,  once  samples 
are  generated  from  the  posterior. 


Example:  Poisson  Data  with  a  Linear  Link 

We  now  describe  a  GLM  that  is  a  little  more  atypical  and  reveals  some  of  the 
subtleties  of  modeling  that  can  occur.  In  the  context  of  a  spatial  study,  suppose 
that,  in  a  given  time  period,  Yi0  represents  the  number  of  counts  of  a  (statistically) 
rare  disease  in  an  unexposed  group  of  size  Ni0,  while  Yu  represents  the  number  of 
counts  of  a  rare  disease  in  an  exposed  group  of  size  N,n ,  all  in  area  i,  i  =  1, . . . ,  n. 
Suppose  also  that  we  only  observe  the  sum  of  the  disease  counts,  Yj  =  K,0  +  Yu, 
along  with  Ni0  and  Nn.  If  we  had  observed  Yi o,  Yu,  we  would  fit  the  model 
Yij  |  (3*  ~ind  Poisson(jV,;j/)* )  so  that  0  <  /3*  <  1  is  the  probability  of 
disease  in  exposure  group  j,  with  j  =  0/1  representing  unexposed/exposed  and 
(3*  =  [/3q,  fii).  Then,  writing  Xi  =  Nu/Ni  as  the  proportion  of  exposed  individuals, 
the  distribution  of  the  total  disease  counts  is 

Yi  |  (3*  ~ind  Poisson  {lVj[(l  -  xz)/3 £  +xipi[]},  (6.17) 

so  that  we  have  a  Poisson  GLM  with  a  linear  link  function.  Since  the  parameters 
/3g  and  /3*  are  the  probabilities  (or  risks)  of  disease  for  unexposed  and  exposed 
individuals,  respectively,  a  parameter  of  interest  is  the  relative  risk,  /3J//3g . 

We  illustrate  the  fitting  of  this  model  using  data  on  the  incidence  of  lip  cancer 
in  men  in  n  =  56  counties  of  Scotland  over  the  years  1975-1980.  These  data  were 
originally  reported  by  Kemp  et  al.  (1985)  and  have  been  subsequently  reanalyzed  by 
numerous  others,  see,  for  example,  Clayton  and  Kaldor  (1987).  The  covariate  Xi  is 
the  proportion  of  individuals  employed  in  agriculture,  fishing,  and  farming  in  county 
i.  We  let  Yi  represent  the  number  of  cases  in  county  i.  Model  (6.17)  requires  some 
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adjustment,  since  the  only  available  data  here,  in  addition  to  Xi,  are  the  expected 
numbers  E,  that  account  for  the  age  breakdown  in  county  i  (see  Sect.  1.3.3).  We 
briefly  describe  the  model  development  in  this  case,  since  it  requires  care  and  reveals 
assumptions  that  may  otherwise  be  unapparent. 

Let  Yijk  be  the  number  of  cases,  from  a  population  of  N,:)k  in  county  i,  exposure 
group  j,  and  age  stratum  k,  i  =  1 ,...  ,n,  j  =  0, 1,  k  =  1, ...  ,K.  An  obvious 
starting  model  for  a  rare  disease  is 

^  ijk  |  Pijk  ^ ind  PoisSOn  (  Mjj  k Pijk  ) . 

This  model  contains  far  too  many  parameters,  pijk,  to  estimate,  and  so  we  simplify 
by  assuming 

Pijk=/3jX-Pk ,  (6.18) 

across  all  areas  i.  Consequently,  pk  is  the  probability  of  disease  in  age  stratum  k 
and  /3j  >  0  is  the  relative  risk  adjustment  in  exposure  group  j,  and  we  are  assuming 
that  the  exposure  effect  is  the  same  across  areas  and  across  age  stratum.  The  age- 
specific  probabilities  pk  are  assumed  known  (e.g.,  being  based  on  rates  from  a  larger 
geographic  region).  The  numbers  of  exposed  individuals  in  each  age  stratum  are 
unknown,  and  we  therefore  make  the  important  assumption  that  the  proportion  of 
exposed  and  unexposed  individuals  is  constant  across  age  stratum,  that  is,  N,ok  = 
Nik(l  —  Xi)  and  Nuk  =  NikXi.  This  assumption  is  made  since  JV;ofc  and  Nuk  are 
unavailable  and  is  distinct  from  assumption  (6.18)  which  concerns  the  underlying 
disease  model.  Summing  across  stratum  and  exposure  groups  gives 

(K  K 

(3o(l  -  Xi)  ^2  NikPk  +  PiXj  Y\  NikPk 

k=l  k=  1 


Letting  Ei  =  N^Pk  represent  the  expected  number  of  cases,  and  simplifying 

the  resultant  expression  gives 

Yi  |  (3  ~ind  Poisson  {£*[(1  -  Xi)p0  +  £i/?i]}  .  (6.19) 


Under  this  model. 


=  A)  +  (/?! 


Po)Xi, 


(6.20) 


illustrating  that  the  mean  model  for  the  standardized  morbidity  ratio  (SMR),  Yi/Ei, 
is  linear  in  x.  Figure  6.2  plots  the  SMRs  Yi/Ei  versus  Xj,  with  a  linear  fit  added, 
and  we  see  evidence  of  increasing  SMR  with  increasing  x. 

Fitting  the  Poisson  linear  link  model  gives  estimates  (asymptotic  standard  errors) 
for  /3q  and  /3i  of  0.45  (0.043)  and  10.1  (0.77).  The  fitted  line  (6.20)  is  superimposed 
on  Fig.  6.2.  The  estimate  of  the  relative  risk  /?i//3o  is  22.7  with  asymptotic  standard 
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Fig.  6.2  Plot  of  standardized 
morbidity  ratio  versus 


o 


proportion  exposed  for  lip 
cancer  incidence  in  56 


counties  of  Scotland.  The 
linear  model  fit  is  indicated 
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error  3.39.  The  latter  is  a  model-based  estimate  and  in  particular  depends  on  there 
being  no  excess-Poisson  variation,  which  is  highly  dubious  for  applications  such  as 
this,  because  of  all  of  the  missing  auxiliary  information,  including  data  on  smoking. 


6.5.3  Hypothesis  Testing 

Suppose  that  dim(/3)  =  k  +  1  and  let  /3  =  [/31;/32]  be  a  partition  with  (31  = 
[/3o, . . . ,  Pq]  and  (32  =  [/3q+i,  ■  ■  • ,  /3k],  with  0  <  q  <  k.  Interest  focuses  on  testing 
whether  the  subset  oik  —  q  parameters  are  equal  to  zero  via  a  test  of  the  null 


H0  :  /31  unrestricted,  (32  =  /320 
:  (3  =  [/3i, /32]  7^  [/3ij /32o]- 


(6.21) 


As  outlined  in  Sect.  2.9,  there  are  three  main  frequentist  approaches  to  hypothesis 
testing,  based  on  Wald,  score,  and  likelihood  ratio  tests.  We  concentrate  on  the  latter. 
For  the  linear  model,  the  equivalent  approach  is  based  on  an  F  test  (Sect.  5.6.1), 
which  formally  accounts  for  estimation  of  the  scale  parameter. 

The  log-likelihood  is 


with  a  the  scale  parameter.  We  let  6  =  0((3)  =  [6*i(/3), . . . ,  9n(/3 )]  denote  the 
vector  of  canonical  parameters.  Under  the  null,  from  Sect.  2.9.5, 


2  Z(3)  -  Z(3(0))  -Ad  Xk-qi 


where  (3  is  the  unrestricted  MLE  and  (3 


73(°) 


[/310,  /320]  is  the  MLE  under  the  null. 
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In  some  circumstances,  one  may  assess  the  overall  fit  of  a  particular  model 
via  comparison  of  the  likelihood  of  this  model  with  the  maximum  attainable  log- 
likelihood  which  occurs  under  the  saturated  model.  We  write  9  =  [9 1, . . .  ,9n] 
to  represent  the  MLEs  under  the  saturated  model.  Similarly,  let  6  =  [01; . . . ,  9n] 
denote  the  MLEs  under  a  reduced  model  containing  q  +  1  parameters.  The  log- 
likelihood  ratio  statistic  of  H0  :  reduced  model,  H1  :  saturated  model  is 


2 


1(6) -1(9) 


2  r  ~  -  ~  -■ 

-  V  Ki(0i-0i)-&(0i)  +  &(0i) 

a  f-—'  L 


D 

•> 

a 


(6.22) 


where  D  is  known  as  the  deviance  (associated  with  the  saturated  model)  and  D/a 
is  the  scaled  deviance.  If  the  saturated  model  has  a  fixed  number  of  parameters,  p, 
then,  under  the  reduced  model, 


u  2 

—  ~>d  Xp-q-1- 

a  y  H 

In  general,  this  result  is  rarely  used,  though  cross-classified  discrete  data  provide 
one  instance  in  which  the  overall  fit  of  a  model  can  be  assessed  in  this  way.  An 
alternative  measure  of  the  overall  fit  is  the  Pearson  statistic 


X2 


h  v (&) 


(6.23) 


with  X2  — Xp-q-i  under  the  null.  Again,  the  saturated  model  should  contain  a 
fixed  number  of  parameters  (as  n  — >  oo). 

Consider  again  the  nested  testing  situation  with  hypotheses,  (6.21).  We  describe 
an  attractive  additivity  property  of  the  likelihood  ratio  test  statistic  for  nested 

~(0)  -(1)  ~(s) 

models.  Let /3  ,/3  and  fj  represent  the  MLEs  of  (3  under  the  null,  alternative, 

-O') 

and  saturated  models,  respectively.  Suppose  that  the  dimensionality  of  (3  is  q3 
with  0  <  q0  <  qi  <  p.  Under  / /q  , 


z(3(1))  -  k3(0)) 


=  2  3(3(s))  -  z(3(0))  -  «3(s))  -  ^(3(1))] 


-  -  (Do- Di)  ->d  X2qi-qo, 

where  Dj  is  the  deviance  representing  the  fit  under  hypothesis  j,  relative  to  the 
saturated  model,  j  =  0, 1.  The  Pearson  statistic  does  not  share  this  additivity 
property. 
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For  a  GLM,  in  contrast  to  the  linear  model  (see  Sect.  5.8),  even  if  a  covariate 
is  orthogonal  to  all  other  covariates,  its  significance  will  still  depend  on  which 
covariates  are  currently  in  the  model. 


Example:  Normal  Linear  Model 


We  consider  the  model  Y  \  (3  ~  N„(a;/3,  a2 In).  The  log-likelihood  is 

l(f3,a)  =  -nlogcr  -  2_(y  ~  xPY(y  ~  XP), 

with  a  in  the  GLM  formulation  being  replaced  by  a2.  Again,  let  (3  =  [f31,/32] 
where  (31  =  [/30, . . . ,  f3q]  and  (32  =  [/3q+ 1, . . . ,  (3k\,  and  consider  the  null  H0  : 
/31  unrestricted,  (32  =  (320 .  Under  this  null,  from  (6.22), 

D  =  j2(Yi-Xipi°))2 

i=  1  '  ' 


-(0) 

where  Xip  are  the  fitted  values  for  the  zth  case,  based  on  the  MLEs  under  the 
reduced  model,  Hq  .  In  this  case,  the  asymptotic  distribution  is  exact  since 


e”=i(^-^3(0)) 


Xn— <?+l  ’ 


(6.24) 


This  result  is  almost  never  directly  useful,  however,  since  a2  is  rarely  known. 

In  terms  of  comparing  the  nested  hypotheses  H0  :  /31  unrestricted,  f32  =  (320, 
and  Hi  :  (3  =  [fii,  /32]  ^  [/31,  /32q]>  the  likelihood  ratio  statistic  is 


4(A)  -£>i) 


1  n  n 

^  )2-J2(Yi-x3  )2 

Li=l  i= 1 

RSSq  -  RSSi  FSSqi 


(6.25) 


where  x/3  are  the  fitted  values  corresponding  to  the  MLEs  under  model  j.  RSSj 
is  the  residual  sum  of  squares  for  model  j,  j  =  0, 1,  and  FSSoi  is  the  fitted  sum  of 
squares  due  to  the  additional  parameters  present  in  Hi. 

In  practice  if  n  is  large,  we  may  use  (6.25)  with  a2  replaced  by  a  consistent 
estimator  a2.  Alternatively,  the  ratios  of  scaled  versions  of  (6.25)  and  (6.24)  may  be 
taken  to  produce  an  F-statistic  by  which  statistical  significance  may  be  assessed,  as 
described  in  Sect.  5.6.1. 
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Example:  Lung  Cancer  and  Radon 

Under  a  Poisson  model,  the  deviance  and  scaled  deviance  are  identical  since  a  =  1. 
For  a  Poisson  model  with  MLE  (3,  the  deviance  is 


0^(3)  -Vi)  +yi  log 


and  if  the  sum  of  the  observed  and  fitted  counts  agree,  then  we  obtain  the  intuitive 
distance  measure 


2  £  Vi  log 

i=  1 


For  the  Minnesota  data,  suppose  we  wish  to  test  Hq  :  do  unrestricted,  3\  =  0  versus 
Hi  :  [/3o,  fii]  ^  [0o,O],  in  the  model  /a  =  Ei  exp(/3o  +  (3iXi).  The  likelihood  ratio 


statistic  is 


T  =  2  £  Vi  loS 

i=l 


since  Y^i-i  =  YHi-i  Piifl  )»  and  where  (3  and  (3  are  the  MLEs  under  the 
null  and  alternative  hypotheses.  Under  Hq,  T  —>d  Xi- 

For  the  Minnesota  data  T  =  46.2  to  give  an  extremely  small  /;- value.  The 
estimate  (standard  error)  of  (3\  is  —0.036  (0.0054)  so  that  for  a  one-unit  increase 
in  average  radon,  there  is  an  associated  drop  in  relative  risk  of  lung  cancer  of  3.6%. 


6.6  Quasi-likelihood  Inference  for  GLMs 

Section  2.5  provided  an  extended  discussion  of  quasi-likelihood,  and  here  we  recap 
the  key  points.  GLMs  that  do  not  contain  a  scale  parameter  are  particularly  vulnera¬ 
ble  to  variance  model  misspecification,  specifically  the  presence  of  overdispersion  in 
the  data.  The  Poisson  and  binomial  models  are  especially  susceptible  in  this  respect. 

Rather  than  specify  a  complete  probability  model  for  the  data,  quasi-likelihood 
proceeds  by  specifying  the  mean  and  variance  as 

E [Yi  |  (3}  =  MiG9) 
var(Yi  |  (3 )  =  aV(m), 

with  co v(Yi,,Yj  |  f3)  =  0.  From  these  specifications,  the  quasi-score  is  defined  as 
in  (2.30)  and  coincides  with  the  score  function  (6.10).  Hence,  the  maximum  quasi¬ 
likelihood  estimator  (3  is  identical  to  the  MLE  due  to  the  multiplicative  form  of  the 
variance  model.  Estimation  of  a  may  be  carried  out  using  the  form  (6.14)  or  via 
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D 


where  D  is  the  deviance  and  dim(/3)  =  k  +  1.  Asymptotic  inference  is  based  on 

(DTV~1  D /a)1/2 ([3n  -  (3)  -> d  Nfe+1(0,  Ifc+1). 

In  practice,  D  and  V  are  evaluated  at  f3n,  and  a  replaces  a. 

Hypothesis  tests  follow  in  an  obvious  fashion,  with  adjustment  for  a.  Specifi¬ 
cally,  if  as  before 

l(P,a)=  [  -\J-dt, 

Jy  aV{t) 

then  if  l(j3)  =  l((3,a  =  1)  represents  the  likelihood  upon  which  the  quasi¬ 
likelihood  is  based  (e.g.,  a  Poisson  or  binomial  likelihood), 

l(/3)  =  l([ 3,  a)  x  a  (6.26) 


and  to  test  Hq  :  (3-{  unrestricted,  f32  =  (320,  we  may  use  the  quasi-likelihood  ratio 
test  statistic 


l(P,a)  -  1(P(0) ,a) 


Xk—q— 17 


or  equivalently 


(6.27) 


If,  as  is  usually  the  case,  a  >  1,  then  larger  differences  in  the  log-likelihood  are 
required  to  attain  the  same  level  of  significance,  as  compared  to  the  a  =  1  case. 


Example:  Lung  Cancer  and  Radon 

Fitting  the  quasi-likelihood  model 


E  %  |  (3\  =  Ei  exp(/30  +  ft  a*)  (6.28) 

var(Fi  |  (3)  =  aE [Yt  \  (3 ],  (6.29) 

yields  identical  point  estimates  for  (3  to  the  Poisson  model,  with  scale  param¬ 
eter  estimate  a  =  2.81,  obtained  via  (6.14).  Therefore,  with  respect  to  Hq  : 
(3q  unrestricted,  ft  =  0,  the  quasi  log-likelihood  ratio  statistic  is  46.2/2.81  =  16.5 
so  that  the  significance  level  is  vastly  reduced,  though  still  strongly  suggestive  of  a 
nonzero  slope. 
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6.7  Sandwich  Estimation  for  GLMs 

The  asymptotic  variance-covariance  for  /3,  which  is  given  by  (6.1 1),  is  appropriate 
only  if  the  first  two  moments  are  correctly  specified.  In  general,  as  detailed  in 
Sect.  2.6,  var(/3)  =  A~1  B(AT)~1  where 

=  D^V^D,  (6.30) 

regardless  of  the  distribution  of  the  data  (so  long  as  the  mean  is  correctly  specified), 
and 

B  =  var  [G{(3)\  =  D1V-1vai(Y)V~1  D , 
where  G(/3)  =  S(f3)/n.  Under  the  assumption  of  uncorrelated  errors. 


where  a  naive  estimator  of  var(Fi)  is 

<5f  =  (Xi  ~  Ah)2;  (6.32) 

which  has  finite  sample  bias.  Combination  of  (6.31)  and  (6.32)  provides  a  consistent 
estimator  of  the  variance  and  therefore  asymptotically  corrects  confidence  interval 
coverage  (so  long  as  independence  of  responses  holds). 

Bootstrap  methods  (Sect.  2.7.2)  may  also  be  used  to  provide  inference  that 
is  robust  to  certain  aspects  of  model  misspecification,  provided  n  is  sufficiently 
large.  The  resampling  residuals  method  may  be  applied,  but  the  meaning  of 
residuals  is  ambiguous  in  GLMs  (Sect.  6.9),  and  this  method  does  not  correct  for 
mean-variance  misspecification,  which  is  a  major  drawback.  The  resampling  cases 
approach  corrects  for  this  aspect.  Davison  and  Hinkley  (1997,  Sect.  7.2)  discuss  both 
resampling  residuals  and  resampling  cases  in  the  context  of  GLMs. 


A  =  E 


dG 

3/3 


Example:  Pharmacokinetics  of  Theophylline 

Table  6.2  gives  confidence  intervals  for  X\ /2,  xmax  and  fj,(xma,x),  based  on  sandwich 
estimation.  In  each  case,  the  interval  estimates  are  a  little  shorter  than  the  model- 
based  estimates.  This  could  be  due  to  either  instability  in  the  sandwich  estimates 
with  a  small  sample  size  (n  =  10)  or  to  the  gamma  mean-variance  assumption 
being  inappropriate. 
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6.8  Bayesian  Inference  for  GLMs 

We  now  consider  Bayesian  inference  for  the  GLM.  The  posterior  is 

p(f3,  a\y)ozl{f3,  a)n(/3,  a) 

where  it  is  usual  to  assume  prior  independence  between  the  regression  coefficients 
(3  and  the  scale  parameter  a,  that  is,  w(j3,  a)  =  n((3)n(a). 


6.8.1  Prior  Specification 


Recall  that  (3  =  [fio,  /3\, . . . ,  (3k\-  Often,  3j ,  j  =  0, 1, . . . ,  k,  is  defined  on  R,  and  so 
a  multivariate  normal  prior  for  (3  is  the  obvious  choice.  Furthermore,  independent 
priors  are  frequently  defined  for  each  component.  As  a  limiting  case,  the  improper 
prior  7t(/3)  oc  1  results.  However,  care  should  be  taken  with  this  choice  since  it 
may  lead  to  an  improper  posterior.  With  canonical  links,  impropriety  only  occurs 
for  pathological  datasets  (see  the  binomial  model  example  of  Sect.  3.4),  but  for 
noncanonical  links,  innocuous  datasets  may  lead  to  impropriety,  as  the  Poisson  data 
with  a  linear  link  example  described  below  illustrates.  If  the  scale  parameter  a  >  0 
is  unknown,  gamma  or  lognormal  distributions  provide  obvious  choices. 


Poisson  Data  with  a  Linear  Link 

Recall  the  Poisson  model  with  a  linear  link  function 

Yi  |  (3  ~ind  Poisson  {^[(1  -  Xi)80  +  xfix}} 
and  suppose  we  assume  an  improper  uniform  prior  for  >  0,  that  is, 

7r(3o)  oc  1. 

We  define  e7  =  /3i//3o  >  0  as  the  parameter  of  interest  and  write 

Pi  =  3oEi[(l  -  Xi )  +  Xi  exp(7)]  =  fop*. 
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The  marginal  posterior  for  7  is 


p( 7  I  y) 


(6.33) 


=  ((7)  x  7r(7), 


(6.34) 


where  the  last  line  follows  from  the  previous  on  recognizing  that  the  integrand  is  the 
kernel  of  a  Ga  t/j,  ^"=1 /i*)  distribution.  The  “likelihood,”  Z( 7)  in  (6.34), 

is  of  multinomial  form  with  the  total  number  of  cases  y+  distributed  among  the  n 
areas  with  probabilities  proportional  to  Ei[(\  —  27) +37  exp(7)]  so  that,  for  example, 
larger  Ei  and  larger  27  (if  7  >  0)  lead  to  a  larger  allocation  of  cases  to  area  i.  The 
likelihood  contribution  to  the  posterior  tends  to  the  constant 


(6.35) 


as  7  — >  — 00 ,  showing  that,  in  general,  a  proper  prior  is  required  (since  the  tail  will 
be  non-integrable).  The  constant  (6.35)  is  nonzero  unless  27  =  1  in  any  area  with 
yi  /  0.  The  reason  for  the  impropriety  is  that  in  the  limit  as  7  — >  —00,  the  relative 
risk  exp(7)  — »  0  so  that  exposed  individuals  cannot  get  the  disease,  which  is  not 
inconsistent  with  the  observed  data,  unless  all  individuals  in  area  i  are  exposed, 
Xi  =  1,  and  j/j  /  0  in  that  area  since  then  clearly  (under  the  assumed  model)  the 
cases  are  due  to  exposure.  A  similar  argument  holds  as  7  — >  00,  with  replacement 
of  1  —  Xi  by  Xi  in  (6.35)  providing  the  limiting  constant. 

Figure  6.3  illustrates  this  behavior  for  the  Scottish  lip  cancer  example,  for  which 
Xi  =  0  in  five  areas.  The  log-  likelihood  has  been  scaled  to  have  maximum  0,  and  the 
constant  (6.35)  is  indicated  with  a  dashed  horizontal  line.  The  MLE  7  =  log(22.7) 
is  indicated  as  a  vertical  dotted  line. 


6.8.2  Computation 

Unfortunately,  when  continuous  covariates  are  present  in  the  model,  conjugate 
analysis  is  unavailable.  However,  sampling-based  approaches  are  relatively  easy  to 
implement.  In  particular,  if  informative  priors  are  available,  then  the  rejection  algo¬ 
rithm  of  Sect.  3.7.6  is  straightforward  to  implement  with  sampling  from  the  prior. 
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Fig.  6.3  Log-likelihood  for 
the  log  relative  risk  parameter 
7,  for  the  Scottish  lip  cancer 
data.  The  dashed  horizontal 
tine  is  the  constant  to  which 
the  log-likelihood  tends  to  as 
7  — >  — oo 
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MCMC  (Sect.  3.8)  is  obviously  a  candidate  for  computation  and  was  illustrated  for 
Poisson  and  negative  binomial  models  in  Chap.  3.  The  1NLA  method  described  in 
Sect.  3.7.4  may  also  be  used. 

As  described  in  Sect.  3.3,  there  is  asymptotic  equivalence  between  the  sampling 
distribution  of  the  MLE  and  the  posterior  distribution.  Hence,  Bayes  estimators  for 
(3  are  consistent  due  to  the  form  of  the  likelihood,  so  long  as  the  priors  are  nonzero 
in  a  neighborhood  of  the  true  values  of  (3. 


6.8.3  Hypothesis  Testing 

A  simple  method  for  examining  hypotheses  involving  a  single  parameter,  Hq  :  (3j  = 
0  versus  Hi  :  (3j  /  0,  with  any  remaining  parameters  unrestricted,  is  to  evaluate 
the  posterior  tail  probability  Pr(/3 'j  >  0  |  y),  with  values  close  to  0  or  1  indicating 
that  the  null  is  unlikely  to  be  true.  Bayes  factors  (which  were  discussed  in  Sects.  3.10 
and  4.3)  provide  a  more  general  tool  for  comparing  hypotheses  (by  analogy  with  the 
likelihood  ratio  statistic,  though  of  course,  as  usual,  interpretation  is  very  different): 


RF  p{y  I  Hp) 

p(y\HiY 


The  use  of  Bayes  factors  will  be  illustrated  in  Sect.  6.16.3.  As  discussed  in 
Sect.  4.3.2,  great  care  is  required  in  the  specification  of  priors  when  model  com¬ 
parison  is  carried  out  using  Bayes  factors. 
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6.8.4  Overdispersed  GLMs 

Quasi-likelihood  provides  a  simple  procedure  by  which  frequentist  inference  may 
accommodate  overdispersion  in  GLMs.  No  such  simple  remedy  exists  within  the 
Bayesian  framework.  An  alternative  method  of  increasing  the  flexibility  of  GLMs 
is  through  the  introduction  of  random  effects.  We  have  already  seen  an  example  of 
this  in  Sect.  2.5  when  the  negative  binomial  model  was  derived  via  the  introduction 
of  gamma  random  effects  into  a  Poisson  model. 


Example:  Lung  Cancer  and  Radon 

The  Bayesian  Poisson  model  was  fitted  in  Chap.  3  using  a  Metropolis-Hastings 
implementation.  Here  the  use  of  the  INLA  method  of  Sect.  3.7.4,  with  improper 
flat  priors  on  (3q,  /3i,  gives  a  95%  interval  estimate  for  the  relative  risk  exp(/3i)  of 
[0.954,0.975]  which  is  identical  to  that  based  on  asymptotic  likelihood  inference 
(the  posterior  mean  and  MLE  both  equal  —0.036,  and  the  posterior  standard 
deviation  and  standard  error  both  equal  0.0054). 


Example:  Pharmacokinetics  of  Theophylline 

With  respect  to  the  gamma  GLM  with  fi(x)  =  exp(^o  +  PiX  +  /32/x),  the 
interpretation  of  /3q  and  82  in  particular  is  not  straightforward,  which  makes 
prior  specification  difficult.  As  an  alternative,  we  specify  prior  distributions  on 
the  half-life  X\/2 ,  time  to  maximum  :i;max,  maximum  concentration  7r(a;max),  and 
coefficient  of  variation,  fo.  We  choose  independent  lognormal  priors  for  these  four 
parameters.  For  a  generic  parameter  9,  denote  the  prior  by  9  ~  LogNormf//,  a).  To 
obtain  the  moments  of  these  distributions,  we  specify  the  prior  median  9m  and  the 
95%  point  of  the  prior  9U.  We  then  solve  for  the  moments  via 

/z  =  log  (0m),  0= — — — — ,  (6.36) 

1.645 

as  described  in  Sect.  3.4.2.  Based  on  a  literature  search,  we  assume  prior  50%  (95%) 
points  of  8  (12),  1.5  (3),  and  9  (12)  for  x\ /2,  xmax,  and  /z(atmax),  respectively.  For 
the  coefficient  of  variation,  the  corresponding  values  are  0.05  (0.10).  The  third  line 
of  Table  6.2  summarizes  these  priors.  To  examine  the  posterior,  we  use  a  rejection 
algorithm,  as  described  in  Sect.  3.7.6.  We  sample  from  the  prior  on  the  parameters 
of  interest  and  then  back-solve  for  the  parameters  that  describe  the  likelihood.  For 
the  loglinear  model,  the  transformation  to  f3  is  via 
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Fig.  6.4  Histogram  representations  of  posterior  distributions  from  the  GLM  for  the  theophylline 
data  for  (a)  half-life,  (b)  time  to  maximum,  (c)  maximum  concentration,  and  (d)  coefficient  of 
variation,  with  priors  superimposed  as  solid  lines 


Xl/2 

fh  = 

/?o  =  log/i(ccmax)  +  2(/3i/32)1/2. 

Table  6.2  summarizes  inference  for  the  parameters  of  interest,  via  medians  and  90% 
interval  estimates.  Point  and  interval  estimates  show  close  correspondence  with  the 
frequentist  summaries.  Figure  6.4  gives  the  posterior  distributions  for  the  half-life, 
the  time  to  maximum  concentration,  the  maximum  concentration,  and  the  coefficient 
of  variation  (expressed  as  a  percentage).  The  prior  distributions  are  also  indicated 
as  solid  curves.  We  see  some  skewness  in  each  of  the  posteriors,  which  is  common 
for  nonlinear  parameters  unless  the  data  are  abundant. 

Inference  for  the  clearance  parameter  is  relatively  straightforward,  since  one 
simply  substitutes  samples  for  (3  into  (6.5).  Figure  6.5  gives  a  histogram  represen¬ 
tation  of  the  posterior  distribution.  The  posterior  median  of  the  clearance  is  0.042 
with  90%  interval  [0.041,0.044];  these  summaries  are  identical  to  the  likelihood- 
based  counterparts.  We  see  that  the  posterior  shows  little  skewness;  the  clearance 
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Fig.  6.5  Posterior 
distribution  of  the  clearance 
parameter  from  the  GLM 
fitted  to  the  theophylline  data 


o 


Clearance 


parameter  is  often  found  to  be  well  behaved,  since  it  is  a  function  of  the  area  under 
the  curve,  which  is  reliably  estimated  so  long  as  the  tail  of  the  curve  is  captured. 


6.9  Assessment  of  Assumptions  for  GLMs 

The  assessment  of  assumptions  for  GLMs  is  more  difficult  than  with  linear  models. 
The  definition  of  a  residual  is  more  ambiguous,  and  for  discrete  data  in  particular,  the 
interpretation  of  residuals  is  far  more  difficult,  even  when  the  model  is  correct. 
Various  attempts  have  been  made  to  provide  a  general  definition  of  residuals  that 
possess  zero  mean,  constant  variance,  and  a  symmetric  distribution.  In  general,  the 
latter  two  desiderata  are  in  conflict. 

When  first  examining  the  data,  one  may  plot  the  response,  transformed  to 
the  linear  predictor  scale,  against  covariates.  For  example,  with  Poisson  data  and 
canonical  log  link,  one  may  plot  log  y  versus  covariates  x. 

The  obvious  definition  of  a  residual  is 

G  —  /Tj 

but  clearly  in  a  GLM,  such  residuals  will  generally  have  unequal  variances  so 
that  some  form  of  standardization  is  required.  Pearson  residuals,  upon  which  we 
concentrate,  are  defined  as 

* _  ^  AL  _  ^  AL 

x/var  (*i) 

where  var(l^)  =  aV(jli)  and  Jl-i  are  the  fitted  values  from  the  model.  Squaring  and 
summing  these  residuals  reproduce  Pearson’s  \2  statistic: 

n 

X2  =  l >,*2, 

2=1 
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as  previously  introduced,  (6.23).  For  Pearson  residuals,  Ep^e*]  =  0andE[e*2]  =  1, 
but  the  third  moment  is  not  equal  to  zero  in  general  so  that  the  residuals  are  skewed. 
As  an  example,  for  Poisson  data,  E[e*3]  =  /r-1/2.  Clearly  for  normal  data,  Pearson 
residuals  have  zero  skewness. 

Deviance  residuals  are  given  by 

e*  =  sign(Fj  -  D, 

so  that  D  =  e*2,  as  defined  in  Sect.  6.5.3.  As  an  example,  for  a  Poisson 

likelihood,  the  deviance  residuals  are 

e*  =  sign (yi  -  ^i){2[yi  log^/fo)  -  Vi  +  fiz]}1/2 ■ 

For  discrete  data  with  small  means,  residuals  are  extremely  difficult  to  interpret 
since  the  response  can  only  take  on  a  small  number  of  discrete  values.  One  strategy 
to  aid  in  interpretation  is  to  simulate  data  with  the  same  design  (i.e.,  x  values)  and 
under  the  parameter  estimates  from  the  fitted  model.  One  may  then  examine  residual 
plots  to  see  their  form  when  the  model  is  known. 

As  with  linear  model  residuals  (Sect.  5.11),  Pearson  or  deviance  residuals  can 
be  plotted  against  covariates  to  suggest  possible  model  forms.  They  may  also 
be  plotted  against  fitted  values  or  some  function  of  the  fitted  values  to  access 
mean-variance  relationships.  If  the  spread  is  not  constant,  then  this  suggests  that 
the  assumed  mean-variance  relationship  is  not  correct.  McCullagh  and  Nelder 
(1989,  p.  398-399)  recommend  plotting  against  the  fitted  values  transformed  to  the 
“constant-information”  scale.  For  example,  for  Poisson  data,  the  suggestion  is  to 
plot  the  residuals  against  2  yffi.  Residuals  can  also  be  examined  for  outliers/points 
of  high  influence. 

For  the  linear  model,  the  diagonal  elements  of  the  hat  matrix,  h  =  x(xTx)~1xT, 
correspond  to  the  leverage  of  response  i,  with  ha  =  1  if  yj  =  Xi/3  (Sect.  5.1 1.2). 
Consideration  of  (6.15)  reveals  that  for  a  GLM  we  may  define  a  hat  matrix  as  h  = 
w1/2x(xTwx)-1x!w1/2 ,  from  which  the  diagonal  elements  may  be  extracted  and, 
once  again,  large  values  of  ha  indicate  that  the  fit  is  sensitive  to  yL  in  some  way. 
As  with  the  linear  model,  responses  with  hvl  close  to  1  have  high  influence.  Unlike 
the  linear  case,  h  depends  on  the  response  through  w.  Another  useful  standardized 
version  of  residuals  is 

*  _ _ Yj  —  [ij _ 

a/(1  -  hu)v&v{Yi) ' 


for  i  =  1, . . . ,  n. 

It  is  approximately  true  that 


y1/2(/l  -  /i)  «  hV~1/2(Y  -  n) 
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(McCullagh  and  Nelder  1989,  p.  397),  and  so 

V~1/2(Y  —  n)  w  (I  —  h)V-1/2{Y  -  ft) 

which  shows  the  effect  of  estimation  of  fi  on  properties  of  the  residuals. 


Example:  Pharmacokinetics  of  Theophylline 


We  fit  the  gamma  GLM  Yt  \  (3,  a  ~ind  Ga  \a  1,(api)  1]  using  MLE  and 
calculate  Pearson  residuals 


In  Fig.  6.6(a),  these  residuals  are  plotted  versus  time  Xi  and  show  no  obvious 
systematic  pattern,  though  interpretation  is  difficult,  given  the  small  number  of  data 
points  and  the  spacing  of  these  points  over  time.  Figure  6.6(b)  plots  \e*\  against 
fitted  values  to  attempt  to  discover  any  unmodeled  mean-variance  relationship,  and 
again  no  strong  signal  is  apparent. 


Example:  Lung  Cancer  and  Radon 

As  we  have  seen,  fitting  the  quasi-likelihood  model  given  by  the  mean  and  variance 
specifications  (6.28)  and  (6.29)  yields  a  =  2.76,  illustrating  a  large  amount  of 
overdispersion.  The  quasi-MLE  for  is  —0.035,  with  standard  error  0.0088.  We 
compare  with  a  negative  binomial  model  having  the  same  loglinear  mean  model  and 

var(Fi)  =  m(l  +  m/b).  (6.37) 

Previously,  a  negative  binomial  model  was  fitted  to  these  data  using  a  frequentist 
approach  in  Sect.  2.5  and  a  Bayesian  approach  in  Sect.  3.8  The  negative  binomial 
MLE  is  —0.029,  with  standard  error  0.0082,  illustrating  that  there  is  some  sensitivity 
to  the  model  fitted. 

For  these  data,  the  MLE  is  b  =  61.3  with  standard  error  17.3.  Figure  6.7 
shows  the  fitted  quadratic  relationship  (6.37)  for  these  data.  We  also  plot  the  quasi¬ 
likelihood  fitted  variance  function.  At  first  sight,  it  is  surprising  that  the  latter 
is  not  steeper,  but  the  jittered  fitted  values  included  at  the  top  of  the  plot  are 
mostly  concentrated  on  smaller  values.  The  few  larger  values  are  very  influential 
in  producing  a  small  estimated  value  of  b  (which  corresponds  to  a  large  departure 
from  the  linear  mean-variance  model). 


6.9  Assessment  of  Assumptions  for  GLMs 
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Fig.  6.6  Pearson  residual  plots  for  the  theophylline  data:  (a)  residuals  versus  time  for  the  GLM, 
(b)  absolute  values  of  residuals  versus  fitted  values  for  the  GLM,  (c)  residuals  versus  time  for  the 
nonlinear  compartmental  model,  and  (d)  absolute  values  of  residuals  versus  fitted  values  for  the 
nonlinear  compartmental  model 


Fig.  6.7  The  solid  line  shows 
the  fitted  negative  binomial 
variance  function, 
var(Y)  =  /I(l  +  'jl/b)  plotted 
versus  p  for  the  lung  cancer 
and  radon  data.  The  dotted 
line  corresponds  to  the  fitted 
quasi-likelihood  model, 
var(Y)  —bxji 
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Fig.  6.8  Absolute  values  of  Poisson  Pearson  residuals  versus  ^//Z  when  the  true  mean-variance 
relationship  is  quadratic,  but  we  analyze  as  if  linear,  for  four  simulated  datasets  with  the  same 
expected  numbers  and  covariate  values  as  in  the  lung  cancer  and  radon  data 

To  attempt  to  determine  which  variance  function  is  more  appropriate,  we 
simulate  data  under  the  negative  binomial  model  using  {.Ej,  xt .  i  =  1 ,n}  and 

[3,6]- 

We  then  fit  a  Poisson  model  (which  provides  identical  fitted  values  as  from 
a  quasi-likelihood  model),  form  residuals  (y  —  jz)/ that  is,  residuals  from  a 
Poisson  model,  and  then  plot  the  absolute  value  versus  v /Ji  to  see  if  we  can  detect 
a  trend.  In  the  majority  of  simulations,  the  inadequacy  of  assuming  the  variance  is 
proportional  to  the  mean  is  apparent;  this  endeavor  is  greatly  helped  by  having  just 
a  few  points  with  very  large  fitted  values.  Specifically,  the  upward  trend  indicates 
that  the  Poisson  linear  mean-variance  assumption  is  not  strong  enough.  Figure  6.8 
shows  four  representative  plots.  Figure  6.9  gives  the  equivalent  plot  from  the  real 
data.  This  plot  shows  a  similar  behavior  to  the  simulated  data,  and  so  we  tentatively 
conclude  that  the  quadratic  mean-variance  relationship  is  more  appropriate  for 
these  data.  Cox  (1983)  provides  further  discussion  of  the  effects  on  estimation 
of  different  forms  of  overdispersion,  including  an  extended  discussion  of  excess- 
Poisson  variation. 


6.10  Nonlinear  Regression  Models 
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Fig.  6.9  Absolute  values  of 
Poisson  Pearson  residuals 
versus  ^/JZ  for  the  lung  cancer 
and  radon  data 


6.10  Nonlinear  Regression  Models 

We  now  consider  models  of  the  form 

Yi  =  ih(J3)  +  Cj,  (6.38) 

for  i  =  1, ...  ,77.,  where  Hi(f3)  =  n{xi,(3)  is  nonlinear  in  Xi,  f3  is  assumed  to 
be  of  dimension  k  +  1,  E[ej  |  Hi]  =  0,  var (e*  |  Hi)  =  <72f{Hi),  and  co v(ei,ej  \ 
Hi,  Hi)  -  0.  Such  models  are  often  used  for  positive  responses,  and  if  such  data 
are  modeled  on  the  original  scale,  it  is  common  to  find  that  the  variance  is  of  the 
form  f(n)  =  H  °r  /(ft)  =  ft2.  An  alternative  approach  that  is  appropriate  for  the 
latter  case  is  to  assume  constant  errors  on  the  log-transformed  response  scale  (see 
Sect.  5.5.3).  More  generally,  we  might  assume  that  var(e;  |  f3,  Xi )  =  a2gi(/3,  Xi), 
with  cov(ei,  Cj  |  (3,  Xi,  Xj)  =  <72 (/3,  *»,  Xj).  When  data  are  measured  over  time, 
serial  correlation  can  be  a  particular  problem.  We  concentrate  on  the  simpler  second 
moment  structure  here. 


Example:  Michaelis-Menten  Model 

A  nonlinear  form  that  is  used  to  model  the  kinetics  of  many  enzymes  has  mean 

v(z)  = - — > 

OL\  +  z 

a  nonlinear  model.  Parameter  interpretation  is  obtained  by  recognizing  that  as  z  — > 
00,  h{z)  ao  and  at  011,  nia  1)  =  ao/2-  A  possible  model  for  such  data  is 


Y{z)  =  h(z)+c(z), 
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with  E[e(z)] 
is  to  write 


0,  var[e(z)]  =  a 2fj,(z)r,  with  r  =  0,1,  or  2.  An  alternative  approach 


1 

Kx) 


Po  +  Pix 


where 


x  =  1  jz 
Po  =  1/ao 
Pi  =  cri/ao, 


which  is  a  GLM  with  reciprocal  link. 


6.11  Identifiability 

For  many  nonlinear  models,  identifiability  is  an  issue,  by  which  we  mean  that  the 
same  curve  may  be  obtained  with  different  sets  of  parameter  values.  We  have  already 
seen  one  example  of  this  for  the  nonlinear  model  fitted  to  the  theophylline  data 
(Sect.  6.2).  As  a  second  example,  consider  the  sum-of-exponentials  model 

(3)  =  Po  exp(-cc/3i)  +  p2  exp (~xp3),  (6.39) 

where  (3  =  [Po,  Pi,  P2,  P3]  and  Pj  >  0,  j  =  0, 1,2,3.  The  same  curve  results 
under  the  parameter  sets  [/?o,  Pi,  P2,  P3]  and  [p2,  P$,  Po,  Pi],  and  so  we  have  non- 
identifiability.  In  the  previous  “flip-flop”  model  (Sect.  6.2),  identifiability  could 
be  imposed  through  a  substantive  assumption  such  as  ka  >  ke  >  0,  and  for 
model  (6.39),  we  may  enforce  (say)  p%  >  Pi  >  0  and  work  with  the  set 

7  =  [log/30,log(/33  -  /?i),log/32,log/3i] 

which  constrains  Pq  >  0,  p2  >  0,  and  pi  >  /33  >  0.  If  a  Bayesian  approach  is 
followed,  a  second  possibility  is  to  retain  the  original  parameter  set,  but  assign  one 
set  of  curves  zero  mass  in  the  prior.  The  latter  option  is  less  appealing  since  it  can 
lead  to  a  discontinuity  in  the  prior. 


6. 12  Likelihood  Inference  for  Nonlinear  Models 
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6.12  Likelihood  Inference  for  Nonlinear  Models 
6.12.1  Estimation 

To  obtain  the  likelihood  function,  a  probability  model  for  the  data  must  be  fully 
specified.  A  common  choice  is 


Yi  \  /3,a  ~indniM(P),v2m(P)r], 

for  i  =  1, . . . ,  n,  and  with  r  =  0, 1,  or  2  being  common  choices.  The  corresponding 
likelihood  function  is 


=  — n  log  cr  —  -^log/Xi(/3)  -  ^  ^ 


1  ^  [Yi-M2 


i=l 


2 *2  t? 


(6.40) 


Differentiation  with  respect  to  (3  and  a  yields,  with  a  little  rearrangement,  the  score 
equations 


Sl(M  =  § 


r  d/jj  1  f[Yi- 

_  9rr2 


2cr2  ^  d(3  Hi  (J3)  i  HliP) 


1  —  A*x  (/^)] 

o-2  M*(/3)r  d(3 

(6.41) 


s*(M  =  £ 


_n  1  ^  K  ~  Hiifl)]2 
*  a3~t 


Notice  that  this  pair  of  quadratic  estimating  functions  (Sect.  2.8)  are  such  that 
E[5'i]  =  0  and  E[S2]  =  0  if  the  first  two  moments  are  correctly  specified,  in 
which  case  consistency  of  (3  results.  It  is  important  to  emphasize  that  if  r  >  0,  we 
require  the  second  moment  to  be  correctly  specified  in  order  to  produce  a  consistent 
estimator  of  (3.  If  r  =  0,  the  first  term  of  (6.41)  disappears,  and  we  require  the  first 
moment  only  for  consistency.  In  general,  the  MLEs  (3  are  not  available  in  closed 
form,  but  numerical  solutions  are  usually  straightforward  (e.g.,via  Gauss-Newton 
methods  or  variants  thereof)  and  are  available  in  most  statistical  software.  The  MLE 
for  cr2  is 

1  ^[Yi-nm2 

a  =-2^ - 


n 


i= 1 


(6.42) 
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but,  by  analogy  with  the  linear  model  case,  it  is  more  usual  to  use  the  degrees  of 
freedom  adjusted  estimator 


n  —  k  —  1 

2—1 


[Yi-rnm2 

K0) 


(6.43) 


For  a  nonlinear  model,  a2  has  finite  sample  bias  but  is  often  preferred  to  (6.42) 
because  of  better  small  sample  performance. 

Under  the  usual  regularity  conditions, 

I(0)1/2(6n  -  0)  — Nfc+1(0,  Ifc+i). 


where  6  =  [/3,  a)  and  1(6)  is  Fisher’s  expected  information.  In  the  case  of  r  =  0, 
we  obtain 


l((3,a)  =  —n  log  a  —  ^(Z3)] 


i=  1 


1  n 

S2{P,a)  =  +  ^3  -  /a(/3)]2 


(6.44) 


Jn  =  -E 


/l2  =  -E 


/21  =  -E 


I22  =  —  E 


dSl 


df3 

dSL 


2=1 


i  \  /  5/i  j 

rr2  \  5/3  /  \  5/3 


5/3  7  V  5/3 


da 

dS2 


5/3 

5ff2 

da 


=  0 


=  0T 


2n 
7/2  ' 


Asymptotically, 


sr=iy^M/3)]2  ;  2 

o  Xn— fc— 1 


(6.45) 


which  may  be  used  to  construct  approximate  F  tests,  as  described  in  Sect.  6.12.2. 
If  r  is  unknown,  then  it  may  also  be  estimated  by  deriving  the  score  from  the 
likelihood  (6.40),  though  an  abundance  of  data  will  be  required.  Estimation  of 
the  power  in  a  related  variance  model  is  carried  out  in  the  example  at  the  end  of 
Sect.  9.20. 


6. 12  Likelihood  Inference  for  Nonlinear  Models 
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Example:  Pharmacokinetics  of  Theophylline 

We  let  yi  represent  the  log  concentration  and  assume  the  model  y,t  |  /3,  a2  ~ind 
N[^(/3),  a2},  i  =  1, . . . ,  n,  where 

f  Dk  'l 

Pi{P)  =  log  |  [exp(-fcex)  -  exp(— fca:r)] j  (6.46) 

with  (3  =  [f30,  (3i,  fo]  and  /?o  =  V,  (3±  =  ka,  /?2  =  ke.  We  fit  this  model  using 
maximum  likelihood  estimation  for  (3  and  the  moment  estimator  (6.43)  for  a2. 
The  results  are  displayed  in  Table  6.2,  with  the  fitted  curve  displayed  on  Fig.  6.1. 
Confidence  intervals,  based  on  the  asymptotic  distribution  of  the  MLE,  were 
calculated  for  the  parameters  of  interest  using  the  delta  method.  These  parameters 
are  all  positive,  and  so  the  intervals  were  obtained  on  the  log-transformed  scale  and 
then  exponentiated. 

In  Fig.  6.10,  slices  through  the  three-dimensional  likelihood  surface  are  dis¬ 
played.  The  two-dimensional  surfaces  are  evaluated  at  the  MLE  of  the  third  variable. 
A  computationally  expensive  alternative  would  be  to  profile  with  respect  to  the  third 
parameter,  as  described  in  Sect.  2.4.2.  In  the  left  column  the  range  of  each  variable 
is  taken  as  three  times  the  asymptotic  standard  errors,  and  the  surfaces  are  very  well 
behaved.  By  contrast,  in  the  right  column  of  the  figure,  the  range  is  ±30  standard 
errors,  and  here  we  see  very  irregular  shapes,  with  some  of  the  contours  remaining 
open.  Such  shapes  are  typical  when  nonlinear  models  are  fitted  and  are  not  in  general 
only  apparent  at  points  far  from  the  maximum  of  the  likelihood. 


6.12.2  Hypothesis  Testing 


As  usual,  hypothesis  tests  may  be  carried  out  using  Wald,  score,  or  likelihood  ratio 
statistics,  and  again  we  concentrate  on  the  latter.  Suppose  that  dim(/3)  =  k  +  1  and 
let  (3  =  \(31,/32\  be  a  partition  with  f31  =  [f30, . . . ,  /3q\  and  f32  =  [/3q+1, /3k\, 
with  0  <  q  <  k.  Interest  focuses  on  testing  whether  a  subset  of  k  —  q  parameters 
are  equal  to  zero  via  a  test  of  the  null 


H0  :  (31  unrestricted,  (32  =  (320  versus  H1  :  (3  =  [/3l5  f32\  ±  [/3X,  /320]- 


Asymptotically,  and  with  known  er, 


2 


a2)  -  l((3 


(0) 


~^d  Xk 


-q- 1 


~(0)  ~(i) 

where  (3  and  (3  are  the  MLEs  under  null  and  alternative,  respectively,  and 
l((3,(72)  is  given  by  (6.40).  Unlike  the  normal  linear  model,  this  result  is  only 
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Fig.  6.10  Likelihood  contours  for  the  theophylline  data  with  the  range  of  each  parameter  being 
the  MLE  ±  3  standard  errors  in  the  left  column  and  ±  30  standard  errors  in  the  right  column ;  (a) 
and  (b)  logfca  versus  log  VL  (c)  and  (d)  logfce  versus  log  V,  and  (e)  and  (f)  logfce  versus  logfca. 
On  each  plot,  the  filled  circle  represents  the  MLE.  In  each  panel,  the  third  variable  is  held  at  its 
maximum  value 


asymptotically  valid  for  a  normal  nonlinear  model.  For  the  usual  case  of  unknown 
a2,  one  may  substitute  an  estimate  or  use  an  F  test  with  degrees  of  freedom  k—q—1 
and  n—k—1,  though  the  numerator  and  denominator  sums  of  squares  are  only 
asymptotically  independent.  The  denominator  sum  of  squares  is  given  in  (6.45). 
More  cautiously,  one  may  assess  the  significance  using  Monte  Carlo  simulation 
under  the  null. 


6.13  Least  Squares  Inference 
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6.13  Least  Squares  Inference 

We  first  consider  model  (6.38)  with  E [e,  |  Hi]  =  0,  var(ei  |  Hi)  =  c2 ,  and 
cov(ei,  ej  |  HiiHj)  =  0-  In  this  case  we  maY  obtain  ordinary  least  squares  estimates, 
(3,  that  minimize  the  sum  of  squares 

n 

=  \r-  my[y  -  mi 

i=i 

Differentiation  with  respect  to  (3,  and  letting  D  be  the  n  x  (k  +  1)  dimensional 
matrix  with  element  dHi/dftj,  yields  the  estimating  function 


n  o 

E[y*-^(/3  )]-^  =  D\Y-h) 

i= 1  P 

which  is  identical  to  (6.44)  and  is  optimal  within  the  class  of  linear  estimating 
functions,  under  correct  specification  of  the  first  two  moments. 

If  we  now  assume  uncorrelated  errors  with  var(ej  |  Hi)  =  cr2/u,3' (/3),  then  the 
method  of  generalized  least  squares  estimates  f3  by  temporarily  forgetting  that 
the  variance  depends  on  f3.  This  is  entirely  analogous  to  the  motivation  for  quasi¬ 
likelihood;  see  the  discussion  centered  around  (2.28)  in  Sect.  2.5.1.  We  therefore 
minimize 


h 


\y  -  -  urn, 


where  V  is  the  n  x  n  diagonal  matrix  with  diagonal  elements  Hi  (/3),  *  =  1, . . . ,  n. 
The  estimating  function  is 


[1»  —  lh(l 3)] 2  d^i 

h  ^  w 


d'v-Hy  -  h), 


which  is  identical  to  that  under  quasi-likelihood  (6.10).  Inference  may  be  based  on 
the  asymptotic  result 

{DTV~1  D / ct2)1/2(3„  ^  (3)  -4 „  Nfc+1(0,  Ifc+i).  (6.47) 


If  the  normal  model  is  true,  then  the  GLS  estimator  is  not  as  efficient  as  that 
obtained  from  a  likelihood  approach  but  is  more  reliable  under  model  misspecifica- 
tion.  Therefore,  the  approach  that  is  followed  should  depend  on  how  much  faith  we 
have  in  the  assumed  model. 
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In  Sect.  9. 10,  we  will  discuss  further  the  trade-offs  encountered  when  one  wishes 
to  exploit  the  additional  information  concerning  (3  contained  within  the  variance 
function. 


6.14  Sandwich  Estimation  for  Nonlinear  Models 

The  sandwich  estimator  of  the  variance  is  again  available  and  takes  exactly  the  same 
form  as  with  the  GLM.  In  particular,  consider  the  estimating  function 

G(J3)  =  B*V-\Y  -  p), 

with  D  an  n  x  [k  + 1)  matrix  with  elements  djii/d(3j,  i  =  1, . . . ,  n,  j  =  0, . . . ,  k  + 
1  and  V  the  diagonal  matrix  with  elements  Vu  =  Hi(/3)r  with  r  >  0  known. 
This  estimating  equation  arises  from  likelihood  considerations  if  r  =  0  or,  more 
generally,  from  GLS.  With  this  form  for  G(-),  (6.30),  (6.31),  and  (6.32)  all  hold. 


Example:  Pharmacokinetics  of  Theophylline 

We  now  let  y,  be  the  concentration  and  consider  the  model  with  first  two  moments 

Dk 

E [Yi  |  f3,  a2}  =  nfP)  =  — — — [exp(-fcex)  -  exp(-fcQa:)] , 

V yka  foe  ) 

var (Yi  |  (3,  a2)  =  a2yil{f3)2, 

for  i  =  1  One  possibility  for  fitting  is  generalized  least  squares.  As  an 

alternative,  we  may  assume  Ti  |  /3,<r2  N[/.q(/3), <72p,;(/3)2],  i  =  1 

and  proceed  with  maximum  likelihood  estimation.  Table  6.3  gives  estimates  of  the 
above  model  under  GLS  and  MLE,  along  with  likelihood  estimation  for  the  model, 

log  Vi  |  (3,  t2  ~irld  N  {log[/rt(/3)],r2}  . 

There  are  some  differences  in  the  table,  but  overall  the  estimates  and  standard  errors 
are  in  reasonable  agreement.  Table  6.2  gives  confidence  intervals  for  Xi/2,  £max, 
and  /i( a;max)  based  on  sandwich  estimation.  As  with  the  GLM  analysis,  the  interval 
estimates  are  a  little  shorter. 
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Table  6.3  Point  estimates  and  asymptotic  standard  errors  for  the  theophylline 
data,  under  various  models  and  estimation  techniques.  In  all  cases  the  coefficient 
of  variation  is  approximately  constant 


Model 

log  V 

log  ka 

log  fce 

MLE  log  scale 

-0.78  (0.035) 

0.79  (0.089) 

-2.39  (0.037) 

GLS  original  scale 

-0.77  (0.030) 

0.81  (0.055) 

-2.39  (0.032) 

MLE  original  scale 

-0.74  (0.025) 

0.85  (0.069) 

-2.45  (0.044) 

6.15  The  Geometry  of  Least  Squares 

In  this  section  we  briefly  discuss  the  geometry  of  least  squares  to  gain  insight  into 
the  fundamental  differences  between  linear  and  nonlinear  fitting. 

We  consider  minimization  of 


(y  -  A0T(y  -  m)  (6-48) 

where  y  and  fi  are  n  x  1  vectors.  We  first  examine  the  linear  model,  fj,  =  x/3,  where 
x  is  n  x  {k  +  1)  and  (3  is  (k  +  1)  x  1.  For  fixed  x,  the  so-called  solution  locus  maps 
out  the  fitted  values  xf3  for  all  values  of  (3  and  is  a  (k  +  1) -dimensional  hyperplane 
of  infinite  extent.  Differentiation  of  (6.48)  gives 

xT(y  —  x(3)  =  xTe  =  0 

where  (3  =  (xTx)~1xTy  and  e  is  the  n  x  1  vector  of  residuals.  So  the  sum  of 
squares  is  minimized  when  the  vector  (y  —  x/3)  is  orthogonal  to  the  hyperplane  that 
constitutes  the  solution  locus.  The  fitted  values  are 

y  =  x/3  =  x(xT  x)-1  xT  y  =  hy, 

and  are  the  orthogonal  projection  of  y  onto  the  plane  spanned  by  the  columns  of  x, 
with  h  the  matrix  that  represents  this  projection. 

For  a  nonlinear  model,  the  solution  locus  is  a  curved  (k+ 1) -dimensional  surface, 
possibly  with  finite  extent.  In  contrast  to  the  linear  model,  equally  spaced  points  on 
lines  in  the  parameter  space  do  not  map  to  equally  spaced  points  on  the  solution 
locus  but  rather  to  unequally  spaced  points  on  curves. 

These  observations  have  several  implications.  In  terms  of  inference,  recall  from 
Sect.  5.6.1,  in  particular  equation  (5.27)  with  q  =  —1,  that  for  a  linear  model,  a 
100(1  —  a)%  confidence  interval  for  (3  is  the  ellipsoid 

(/3  -  P)TxTx(f3  -  3)  <  ( k  +  l)s2Ffc+iin_fc_i  (1  -  a). 

Geometrically,  the  region  has  this  form  because  the  solution  locus  is  a  plane  and  the 
residual  vector  is  orthogonal  to  the  plane  so  that  values  of  (3  map  onto  a  disk.  For 
nonlinear  models,  asymptotic  inference  for  (3  results  from 
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(J3  -  pyv~\(3  -  3)  <  (k  +  1  )s2Ffe+li„_fe_1(l  -  a), 

where  var(/3)  =  a2V,  with  a2  =  s2.  The  approximation  occurs  because  the 
solution  locus  is  curved,  and  equi-spaced  points  in  the  parameter  space  map  to 
unequally  spaced  points  on  curved  lines  on  the  solution  locus.  Intuitively,  inference 
will  be  more  accurate  if  the  relevant  part  of  the  solution  locus  is  flat  and  if  parallel 
equi-spaced  lines  in  the  parameter  space  map  to  parallel  equi-spaced  lines  on  the 
solution  locus.  The  curvature  and  lack  of  equally  spaced  points  manifest  itself 
in  contours  of  equal  likelihood  being  banana-shaped  and  perhaps  “open”  (so  that 
they  do  not  join).  The  right  column  of  Fig.  6.10  gives  examples  of  this  behavior. 
Another  important  aspect  is  that  reparameterization  of  the  model  can  alter  the 
behavior  of  points  mapped  onto  the  solution  locus,  but  cannot  affect  the  curvature 
of  the  locus.  Hence,  the  curvature  of  the  solution  locus  has  been  referred  to  as 
the  intrinsic  curvature  (Beale  1960;  Bates  and  Watts  1980),  while  the  aspect  that 
is  parameterization  dependent  is  the  parameter-effects  curvature  (Bates  and  Watts 
1980).  We  note  that  the  solution  locus  does  not  depend  on  the  observed  data  but 
only  on  the  model  and  design.  As  n  — >  oo,  the  surface  becomes  increasingly  locally 
linear  and  inference  correspondingly  more  accurate. 

We  illustrate  with  a  simple  fictitious  example  with  n  =  2,  x  =  [1,  2],  and  y  = 
[0.2, 0.7].  We  compare  two  models,  each  with  a  single  parameter,  the  linear  zero 
intercept  model 

p  =  x0,  —oo  <  0  <  oo, 

and  the  (simplified)  nonlinear  Michaelis-Menten  model 

p  =  x/(x  +  9),  9  >  0. 

Figure  6.11  (a)  plots  the  data  versus  the  two  fitted  curves  (obtained  via  least  squares), 
while  panel  (b)  plots  the  solution  locus  for  the  linear  model,  which  in  this  case  is  a 
line  (since  k  =  0).  The  point  [xi0,  X20]  with  least  squares  estimate 

2  2 

3  =  X]  ^2/i/  X]  xi  =  °’32’ 

2=1  2=1 

is  the  fitted  point  and  is  indicated  as  a  solid  circle.  The  dashed  line  is  the  vector 
joining  [y  1 ,  iff  to  the  fitted  point  and  is  perpendicular  to  the  curved  solution  locus. 
The  circles  indicated  on  the  solution  locus  correspond  to  changes  in  0  of  0. 1  and 
are  equi-spaced  on  the  locus.  The  final  aspect  to  note  is  that  the  locus  is  of  infinite 
extent. 

Panel  (c)  of  Fig.  6.11  plots  the  solution  locus  for  the  Michaelis-Menten  model, 
for  which  9  =  1.70.  The  vector  joining  [2/1 , 2/2]  t°  the  fitted  values  [x\/{x\  +9),  X2  / 
(x2  +  0)]  is  perpendicular  to  the  curved  solution  locus,  but  we  see  that  points  on  the 
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Fig.  6.11  (a)  Fictitious  data  with  x  =  [1,  2]  and  y  =  [0.2,  0.7],  and  fitted  lines  (b)  solution  locus 
for  the  zero  intercept  linear  model  with  the  observed  data  indicated  as  a  cross  and  the  fitted  value  as 
a  filled  circle ,  (c)  solution  locus  for  the  Michaelis-Menten  model  with  the  observed  data  indicated 
as  a  cross  and  the  fitted  value  as  a  filled  circle,  and  (d)  solution  locus  for  the  Michaelis-Menten 
model  under  a  second  parametrization  with  the  observed  data  indicated  as  a  cross  and  the  fitted 
value  as  a  filled  circle 


latter  are  not  equally  spaced.  Also,  the  solution  locus  is  of  finite  extent  moving  from 
the  point  [0,  0]  for  9  =  oo  to  the  point  (1,1)  for  9  =  0  (these  are  the  asymptotes 
of  the  model).  Finally,  panel  (d)  reproduces  panel  (c)  with  the  Michaelis-Menten 


model  reparameterized  as 


x1/[x1  +exp(0)],x2/[a;2  +exp(</>)] 


,  with  <f>  =  log  9. 


The  spacing  of  points  on  the  solution  locus  is  quite  different  under  the  new 
parameterization.  The  points  are  more  equally  spaced  close  to  the  fitted  value, 
indicating  that  asymptotic  standard  errors  are  more  likely  to  be  accurate  under  this 
parametrization. 
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6.16  Bayesian  Inference  for  Nonlinear  Models 

Bayesian  inference  for  nonlinear  models  is  based  on  the  posterior  distribution 

p(/3,cr2  |  y)  oc  I(f3)n(f3,a2). 

We  discuss  in  turn  prior  specification,  computation,  and  hypothesis  testing. 


6.16.1  Prior  Specification 


We  begin  by  assuming  independent  priors  on  /3  and  cr2: 


7t(/3,CT2)  =  7t(/3)7t(ct2). 


The  prior  on  cr2  is  a  less  critical  choice,  and  a~ 2  ~  Ga(a,  b)  is  an  obvious  candidate. 
The  choice  a  =  b  =  0,  which  gives  the  improper  prior  7r(cr2)  oc  1/tx2,  will  often 
be  a  reasonable  option.  If  the  variance  model  is  of  the  form  var(Yi)  =  cr2 yi(/3)r , 
then  clearly  substantive  prior  beliefs  will  depend  on  r  so  that  we  must  specify  the 
conditional  form  7r(a2  |  r),  since  the  scale  of  a2  depends  on  the  choice  for  r. 

So  far  as  a  prior  for  (3  is  concerned,  great  care  must  be  taken  to  ensure  that  the 
resultant  posterior  is  proper;  Sect.  3.4  provided  an  example  of  the  problems  that  can 
arise  with  a  nonlinear  model.  In  general,  models  must  be  considered  on  a  case-by¬ 
case  basis.  However,  a  parameter,  0  (say),  corresponding  to  an  asymptote  (so  that 
y  a  as  6*  — >■  oo),  will  generally  require  proper  priors  because  the  likelihood  tends 
to  the  constant 


exp 


as  6  — >  oc  and  not  zero  as  is  necessary  to  ensure  propriety. 


6.16.2  Computation 

Unfortunately,  closed-form  posterior  distributions  do  not  exist  with  a  nonlinear 
mean  function,  but  sampling-based  methods  are  again  relatively  straightforward 
to  implement.  A  pure  Gibbs  sampling  strategy  (Sect.  3.8.4)  is  not  so  appealing 
since  the  conditional  distribution,  (3  \  y,  o,  will  not  have  a  familiar  form.  How¬ 
ever,  Metropolis-Hastings  algorithms  (Sect.  3.8.2)  will  be  easy  to  construct.  If  an 
informative  prior  is  present,  direct  sampling  via  a  rejection  algorithm,  with  the  prior 
as  a  proposal,  may  present  a  viable  option. 
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6.16.3  Hypothesis  Testing 

As  with  GLMs  (Sect.  6.8.3),  posterior  tail  areas  and  Bayes  factors  are  available  to 
test  hypotheses/compare  models. 


Example:  Pharmacokinetics  of  Theophylline 

We  report  a  Bayesian  analysis  of  the  theophylline  data  and  specify  lognormal  priors 
for  Xi/2,  xmax,  and  /r(xmax)  using  the  same  specification  as  with  the  GLM  analysis. 
Samples  from  the  posterior  for  [V,  ka ,  ke\  are  obtained  from  the  rejection  algorithm. 
Specifically,  we  sample  from  the  prior  on  the  parameters  of  interest  and  then  back- 
solve  for  the  parameters  that  describe  the  likelihood.  For  the  compartmental  model, 
we  transform  back  to  the  original  parameters  via 


ke  =  (\og2)/x1/2 


(6.49) 


so  that  ka  is  not  directly  available  but  must  be  obtained  as  the  root  of  (6.49). 

Table  6.2  summarizes  inference  for  the  parameters  of  interest  with  the  interval 
estimates  and  medians  being  obtained  as  the  sample  quantiles.  Figure  6.12  shows 
the  posteriors  for  functions  of  interest  under  the  nonlinear  model.  The  posteriors  are 
skewed  for  all  functions  of  interest.  These  figures  and  Table  6.2  show  that  Bayesian 
inference  for  the  GLM  and  nonlinear  model  are  very  similar.  Frequentist  and 
Bayesian  methods  are  also  in  close  agreement  for  these  data,  which  is  reassuring. 

Recall  that  the  parameter  sets  [ V, ,  ka,  ke]  and  [Vke/ka.  ke,  ka]  produce  identical 
curves  for  the  compartmental  model  (6.1).  One  solution  to  this  identifiability 
problem  is  to  enforce  ka  >  ke  >  0,  for  example,  by  parameterizing  in  terms  of 
log  ke  and  log(fca  —  ke).  Pragmatically,  not  resorting  to  this  parameterization  is 
reasonable,  so  long  as  ka  and  ke  are  not  close.  Figure  6.13  shows  the  bivariate 
posterior  distribution  p(ka,  ke  \  y),  and  we  see  that  ka  ke  for  these  data,  and  so 
there  is  no  need  to  address  the  identifiability  issue. 

Another  benefit  of  specifying  the  prior  in  terms  of  model-free  parameters  is  that 
models  may  be  compared  using  Bayes  factors  on  an  "even  playing  field,”  in  the  sense 
that  the  prior  input  for  each  model  is  identical.  For  more  discussion  of  this  issue, 
see  Perez  and  Berger  (2002).  To  illustrate,  we  compare  the  GLM  and  nonlinear 
compartmental  models.  The  normalizing  constants  for  these  models  are  0.00077 
and  0.00032,  respectively,  as  estimated  via  importance  sampling  with  the  prior  as 
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7.0  7.5  8.0  8.5 


1.2  1.4  1.6  1.8 


Fig.  6.12  Histogram  representations  of  posterior  distributions  from  the  nonlinear  compartmental 
model  for  the  theophylline  data  for  the  (a)  half-life,  (b)  time  to  maximum,  (c)  maximum 
concentration,  and  (d)  coefficient  of  variation,  with  priors  superimposed  as  solid  lines 


Fig.  6.13  Image  plot  of 
samples  from  the  joint 
posterior  distribution  of  the 
absorption  and  elimination 
rate  constants,  ka  and  ke,  for 
the  theophylline  data 


proposal  and  using  (3.28).  Consequently,  the  Bayes  factor  comparing  the  GLM  to 
the  nonlinear  model  is  2.4  so  that  the  data  are  just  over  twice  as  likely  under  the 
GLM,  but  this  is  not  strong  evidence.  For  these  data,  based  on  the  above  analyses,  we 
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Fig.  6.14  Histogram  representations  of  posterior  distributions  from  the  nonlinear  compartmental 
models  for  the  reduced  theophylline  dataset  of  n  =  3  points  for  the  (a)  half-life,  (b)  time  to 
maximum,  (c)  maximum  concentration,  and  (d)  coefficient  of  variation,  with  priors  superimposed 
as  solid  lines 


conclude  that  both  the  GLM  and  the  nonlinear  models  provide  adequate  fits  to  the 
data,  and  there  is  little  difference  between  the  frequentist  and  Bayesian  approaches 
to  inference. 

We  now  demonstrate  the  benefits  of  a  Bayesian  approach  with  substantive  prior 
information,  when  the  data  are  sparse.  To  this  end,  we  consider  a  reduced  dataset 
consisting  of  the  first  n  =  3  concentrations  only.  Clearly,  a  likelihood  or  least 
squares  approach  is  not  possible  in  this  case,  since  the  number  of  parameters 
(three  regression  parameters  plus  a  variance)  is  greater  than  the  number  of  data 
points.  We  fit  the  nonlinear  model  with  the  same  priors  as  used  previously  and 
with  computation  carried  out  with  the  rejection  algorithm.  Figure  6.14  shows  the 
posterior  distributions,  with  the  priors  also  indicated.  As  we  might  expect,  there  is 
no/little  information  in  the  data  concerning  the  terminal  half-life  log  ke/2  or  the 
standard  deviation  a.  In  contrast,  the  data  are  somewhat  informative  with  respect  to 
the  time  to  maximum  concentration,  and  the  maximum  concentration. 
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6.17  Assessment  of  Assumptions  for  Nonlinear  Models 


In  contrast  to  GLMs,  residuals  are  unambiguously  defined  for  nonlinear  models  as 


,*  =  Vi-fij 
\/var(Yi)’ 


(6.50) 


which  we  refer  to  as  Pearson  residuals.  These  residuals  may  be  used  in  the  usual 
ways;  see  Sects.  5.11.3  and  6.9.  In  particular,  the  residuals  may  be  plotted  versus 
covariates  to  assess  the  mean  model,  and  the  absolute  values  of  the  residuals  may 
be  plotted  versus  the  fitted  values  jli  to  assess  the  appropriateness  of  the  mean- 
variance  model.  For  a  small  sample  size,  normality  of  the  errors  will  aid  in  accurate 
asymptotic  inference  and  may  be  assessed  via  a  normal  QQ  plot,  as  described  in 
Sect.  5.1 1.3. 


Example:  Pharmacokinetics  of  Theophylline 

Letting  j/7;  represent  the  log  concentration  at  time  xr ,  we  examine  the  Pearson 
residuals,  as  given  by  (6.50),  obtained  following  likelihood  estimation  with  the 
model  y,  \  (3,  a2  ~ind  N  (/x^,  cr2),  with  /jn  given  by  (6.46),  for  i  = 

Figure  6.6(c)  plots  e *  versus  x,  and  shows  no  gross  inadequacy  of  the  mean  model. 
Panel  (d),  which  plots  |e*|  versus  x, ,  similarly  shows  no  great  problem  with  the 
mean-variance  relationship.  Figure  6. 15  gives  a  normal  QQ  plot  of  the  residuals  and 
indicates  no  strong  violation  of  normality.  In  all  cases,  interpretation  is  hampered  by 
the  small  sample  size. 


Fig.  6.15  Normal  QQ  plot 
for  the  theophylline  data  and 
model  (6.46) 
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6.18  Concluding  Remarks 

Within  the  broad  class  of  general  regression  models,  the  use  of  GLMs  offers  certain 
advantages  in  terms  of  computation  and  interpretation,  though  one  should  not 
restrict  attention  to  this  class.  Many  results  and  approaches  used  for  linear  models 
hold  approximately  for  GLMs.  For  example,  the  influence  of  points  was  defined 
through  the  weight  matrix  used  in  the  “working  response”  approach  implicit  in  the 
IRLS  algorithm  (Sect.  6.5.2).  The  form  of  GLMs,  in  particular  the  linearity  of  the 
score  with  respect  to  the  responses,  is  such  that  asymptotic  inference  is  accurate  for 
relatively  small  n. 

Care  is  required  in  the  fitting  of,  and  inference  for,  nonlinear  models.  For 
example,  models  must  be  examined  to  see  if  the  parameters  are  uniquely  identified. 
For  both  GLMs  and  nonlinear  models,  the  examination  of  residual  plots  is  essential 
to  determine  whether  the  assumed  model  is  appropriate,  but  such  plots  are  difficult 
to  interpret  because  the  behavior  of  residuals  is  not  always  obvious,  even  if  the 
fitted  model  is  correct.  The  use  of  a  distribution  from  the  exponential  family  is 
advantageous  in  that  results  on  consistency  of  estimators  follow  easily,  as  discussed 
in  Sect.  6.5.1.  The  identifiability  of  nonlinear  models  should  always  be  examined, 
and  one  should  be  wary  of  the  accuracy  of  asymptotic  inference  for  small  sample 
sizes.  The  parameterization  adopted  is  also  important,  as  discussed  in  Sect.  6.15. 


6.19  Bibliographic  Notes 

The  most  comprehensive  and  interesting  description  of  GLMs  remains  McCullagh 
and  Nelder  (1989).  An  excellent  review  is  also  given  by  Firth  (1993).  Sandwich 
estimation  for  GLMs  is  discussed  by  Kauermann  and  Carroll  (2001). 

Nonlinear  models  are  discussed  by  Bates  and  Watts  (1988)  and  Chap.  2  of 
Davidian  and  Giltinan  (1995),  with  an  emphasis  on  generalized  least  squares.  Book- 
length  treatments  on  nonlinear  models  are  provided  by  Gallant  (1987);  Seber  and 
Wild  (1989);  see  also  Carroll  and  Ruppert  (1988). 

Gibaldi  and  Perrier  (1982)  provide  a  comprehensive  account  of  pharmacokinetic 
models  and  principles  and  Godfrey  (1983)  an  account  of  compartmental  modeling 
in  general.  Wakefield  et  al.  (1999)  provide  a  review  of  pharmacokinetic  and 
pharmacodynamic  modeling  including  details  on  both  the  biological  and  statistical 
aspects  of  such  modeling.  The  model  given  by  (6.7)  and  (6.8)  was  suggested  by 
Wakefield  (2004)  and  was  developed  more  extensively  in  Salway  and  Wakefield 
(2008). 
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6.20  Exercises 


6.1  A  random  variable  Y  is  inverse  Gaussian  if  its  density  is  of  the  form 


p(y  I  A, A)  = 


'-S(y-xr 

.  2A  2y 


for  y  >  0. 

(a)  Show  that  the  inverse  Gaussian  distribution  is  a  member  of  the  exponential 
family  and  identify  9,  a,  b(9),  a(a ),  and  c(y ,  a). 

(b)  Give  forms  for  E[F  |  9 ,  a]  and  var(F  |  9,  a)  and  determine  the  canonical 
link  function. 

6.2  Table  6.4  reproduces  data,  from  Altham  (1991),  of  counts  of  T4  cells/mm3  in 
blood  samples  from  20  patients  in  remission  from  Hodgkin’s  disease  and  from 
20  additional  patients  in  remission  from  disseminated  malignancies.  A  question 
of  interest  here  is  whether  there  is  a  difference  in  the  distribution  of  cell  counts 
between  the  two  diseases.  A  quantitative  assessment  of  any  difference  is  also 
desirable. 

(a)  Carry  out  an  exploratory  examination  of  these  data  and  provide  an  informa¬ 
tive  graphical  summary  of  the  two  distributions  of  responses. 

(b)  These  data  may  be  examined:  (1)  on  their  original  scale,  (2)  loge  trans¬ 
formed,  and  (3)  square  root  transformed.  Carefully  define  a  difference 
in  location  parameter  in  each  of  the  designated  scales.  What  are  the 
considerations  when  choosing  a  scale?  Obtain  90%  confidence  interval  for 
each  of  the  difference  parameters. 

(c)  Fit  Poisson,  gamma,  and  inverse  Gaussian  models  to  the  cell  count  data, 
assuming  canonical  links  in  each  case. 

(d)  Using  the  asymptotic  distribution  of  the  MLE,  give  90%  confidence 
intervals  for  the  difference  parameters  in  each  of  the  three  models.  Under 
each  of  the  models,  would  you  conclude  that  the  means  of  the  two  groups 
are  equal? 

6.3  The  data  in  Table  6.5,  taken  from  Wakefield  et  al.  (1994),  were  collected 
following  the  administration  of  a  single  30  mg  dose  of  the  drug  cadralazine 


Table  6.4  Counts  of  T4  cells/mnr5  in  blood  samples  from  20  patients  in  remission  from  Hodgkin’s 
disease  and  20  other  patients  in  remission  from  disseminated  malignancies 


Hodgkin’s  disease 

396 

568 

1,212 

171 

554 

1,104 

257 

435 

295 

397 

Non-Hodgkin’s  disease 

375 

375 

752 

208 

151 

116 

736 

192 

315 

1,252 

Hodgkin’s  disease 

288 

1,004 

431 

795 

1,621 

1,378 

902 

958 

1,283 

2,415 

Non-Hodgkin’s  disease 

675 

700 

440 

771 

688 

426 

410 

979 

377 

503 
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Table  6.5  Concentrations  y, 
of  the  drug  cadralazine  as  a 
function  of  time  Xi,  obtained 
from  a  subject  who  was 
administered  a  dose  of  30  mg. 
These  data  are  from 
Wakefield  et  al.  (1994) 


Observation 

number 

i 

Time 

(hours) 

Xi 

Concentration 

(mg/liter) 

Vi 

1 

2 

1.63 

2 

4 

1.01 

3 

6 

0.73 

4 

8 

0.55 

5 

10 

0.41 

6 

24 

0.01 

7 

28 

0.06 

8 

32 

0.02 

to  a  cardiac  failure  patient.  The  response  j/j  represents  the  drug  concentration 
at  time  a u,  i  =  1, . . . ,  8.  The  most  straightforward  model  for  these  data  is  to 
assume 


log  yt  =  n((3)  +  ez 


log 


^exp  (-keXi) 


+  Cj, 


where  e,  |  a2  ~nd  N(0,  cr2),  f3  =  [U,  ke]  and  the  dose  is  D  =  30.  The 
parameters  are  the  volume  of  distribution  V  >  0  and  the  elimination  rate  ke. 


(a)  For  this  model,  obtain  expressions  for: 

(i)  The  log-likelihood  function  L(/3,  a2) 

(ii)  The  score  function  S(/3,a2) 

(iii)  The  expected  information  matrix  I(/3,  <j2) 

(b)  Obtain  the  MLE  and  provide  an  asymptotic  95%  confidence  interval  for 
each  element  of  (3. 

(c)  Plot  the  data,  along  with  the  fitted  curve. 

(d)  Using  residuals,  examine  the  appropriateness  of  the  assumptions  of  the 
above  model.  Does  the  model  seem  reasonable  for  these  data? 

(e)  The  clearance  Cl  =  V  x  ke  and  elimination  half-life  X\/2  =  log  2 /ke 
are  parameters  of  interest  in  this  experiment.  Find  the  MLEs  of  these 
parameters  along  with  asymptotic  95%  confidence  intervals. 


A  Bayesian  analysis  will  now  be  carried  out,  assuming  independent 
lognormal  priors  for  V,  ke  and  an  independent  inverse  gamma  prior  for 
a2.  For  the  latter,  assume  the  improper  prior  7r(cr2)  cx  cr-2. 

(f)  Assume  that  the  50%  and  90%  points  for  V  are  20  and  40  and  that  for  ke, 
these  points  are  0.12  and  0.25.  Solve  for  the  lognormal  parameters  using 
the  method  of  moments  equations  (6.36). 

(g)  Implement  an  MCMC  Metropolis-Hastings  algorithm  (Sect.  3.8.2).  Report 
the  median  and  90%  interval  estimates  for  each  of  V,  ke,  Cl,  and  Xi/2.  Pro- 
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vide  graphical  summaries  of  each  of  the  univariate  and  bivariate  posterior 
distributions. 

6.4  Let  Yi  represent  a  count  and  x,  =  \x,-f .....  xrk]  a  covariate  vector  for 
individual  i,i  =  1, . . . ,  n.  Assume  that  Y  |  /. q  ~ud  Poisson {fii),  with 


Mi  =  E[Yi  |  7oi,7i,  •  • .  ,7 jfe]  =  exp(7o,;  +  7iXji  +  ...  +  jkXik),  (6.51) 
where  the  intercept  is  a  random  effect  (see  Chap.  9)  that  varies  according  to 
7oi  I  7o>  t2  ~iid  N(7o,t2). 


(a)  Give  an  interpretation  of  each  of  the  parameters  70  and  71 . 

(b)  Suppose  we  fit  an  alternative  Poisson  model  with  mean 

ffi  =  E [Yi  |  /30,  , /5fc]  =  exp(/30  +  PiXa  +  . . .  +  /3kxik).  (6.52) 

Evaluate 

E [Yi  |  r2, 70, 71, ... ,  7fc], 

and  hence,  by  comparison  with  E[Y*  |  /3q,/3i,  . . .  ,/?&],  equate  7 j  to 

j  =0,  l,...,fc. 

(c)  Evaluate  vat '(YI  \  r2, 70, 71, . . . ,  7fe)  and  compare  this  expression  with 

var(Yi  |  /30,/3i,...,/3fc). 

(d)  Suppose  one  is  interested  in  the  parameters  71, . . . ,  7;. .  Use  your  answers 
to  the  previous  two  parts  to  discuss  the  implications  of  fitting  model  (6.52) 
when  the  true  model  is  (6.5 1). 

(e)  Now  consider  an  alternative  random  effects  structure  in  which 

di  \  Q,^  b  ^ad  Ga(a, 

where  5i  =  exp(7o,;).  Evaluate  the  marginal  mean  E[Yj  |  a,  6, 71, ... ,  7fc] 
and  marginal  variance  var(Yi  |  a,  6, 71, ... ,  7 k). 

(f)  Compare  the  expressions  for  the  mean  and  variance  under  the  normal  and 
gamma  formulations. 

(g)  For  the  Poisson-Gamma  model,  calculate  the  form  of  the  likelihood 

n  r> 

£(7i,--->7 k,a,b)  =  n  /  Pr(^  I  70i,7i,---!7fe)7r(70i  |  a,b)  djoi. 

i=  1  J 

Derive  expressions  for  the  score  and  information  matrix  and  hence  describe 
how  inference  may  be  performed  from  a  likelihood  standpoint. 
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Table  6.6  Concentrations  y, 

Observation 

Time 

Concentration 

of  the  drug  theophylline  as  a 

number 

(hours) 

(mg/liter) 

function  of  time  Xi  obtained 
from  a  subject  who  was 
administered  an  oral  dose  of 

i 

Xi 

Vi 

i 

0.27 

1.72 

size  4.40  mg/kg 

2 

0.52 

7.91 

3 

1.00 

8.31 

4 

1.92 

8.33 

5 

3.50 

6.85 

6 

5.02 

6.08 

7 

7.03 

5.40 

8 

9.00 

4.55 

9 

12.00 

3.01 

10 

24.30 

0.90 

6.5  Table  6.6  gives  concentration-time  data  for  an  individual  who  was  given  a  dose 

of  4.40  mg/kg  of  the  drug  theophylline.  In  this  chapter  we  have  analyzed  the 

data  from  another  of  the  individuals  in  the  same  trial. 

(a)  For  the  data  in  Table  6.6, 1  fit  the  gamma  GLM  given  by  (6.7)  and  (6.8) 
using  maximum  likelihood  and  report  the  MLEs  and  standard  errors. 

(b)  Obtain  MLEs  and  standard  errors  for  the  parameters  of  interest  Xi/2,  x„i;ix, 
M^max),  and  Cl. 

(c)  Let  Zi  represent  the  log  concentration  and  consider  the  model  Zi  |  /3.  o1  ^lnd 

N[/ij(/3), a2],  i  =  1 , ,n,  where  is  given  by  the  compartmental 

model  (6.46).  Fit  this  model  using  maximum  likelihood  and  report  the 
MLEs  and  standard  errors. 

(d)  Obtain  the  MLEs  and  standard  errors  for  the  parameters  of  interest  Xi/2, 
•tmax;  max) i  and  Cl. 

(e)  Compare  these  summaries  with  those  obtained  under  the  GLM. 

(f)  Examine  the  fit  of  the  two  models  and  discuss  which  provides  the  better  fit. 


'These  data  correspond  to  individual  2  in  the  Theoph  data,  which  are  available  in  R. 


Chapter  7 

Binary  Data  Models 


7.1  Introduction 

In  this  chapter  we  consider  the  modeling  of  binary  data.  Such  data  are  ubiquitous 
in  many  fields.  Binary  data  present  a  number  of  distinct  challenges,  and  so  we 
devote  a  separate  chapter  to  their  modeling,  though  we  lean  heavily  on  the  methods 
introduced  in  Chap.  6  on  general  regression  modeling.  It  is  perhaps  surprising  that 
the  simplest  form  of  outcome  can  pose  difficulties  in  analysis,  but  a  major  problem 
is  the  lack  of  information  contained  within  a  variable  that  can  take  one  of  only 
two  values.  This  can  lead  to  a  number  of  problems,  for  example,  in  assessing 
model  fit.  Another  major  complication  arises  because  models  for  probabilities  are 
generally  nonlinear,  which  can  lead  to  curious  behavior  of  estimators  in  the  presence 
of  confounders.  Difficulties  in  interpretation  also  arise,  even  when  independent 
regressors  are  added  to  the  model. 

The  outline  of  this  chapter  is  as  follows.  We  give  some  motivating  examples 
in  Sect.  7.2,  and  in  Sect.  7.3,  describe  the  genesis  of  the  binomial  model,  which 
is  a  natural  candidate  for  the  analysis  of  binary  data.  Generalized  linear  models  for 
binary  data  are  examined  in  Sect.  7.4.  The  binomial  model  has  a  variance  determined 
by  the  mean,  with  no  additional  parameter  to  accommodate  excess-binomial 
variation,  and  so  Sect.  7.5  describes  methods  for  dealing  with  such  variation.  For 
reasons  that  will  become  apparent,  we  will  focus  on  logistic  regression  models, 
beginning  with  a  detailed  description  in  Sect.  7.6.  This  section  includes  discussions 
of  estimation  from  likelihood,  quasi-likelihood,  and  Bayesian  perspectives.  Condi¬ 
tional  likelihood  and  “exact”  inference  are  the  subject  of  Sect.  7.7.  Assessing  the 
adequacy  of  binary  models  is  discussed  in  Sect.  7.8.  Summary  measures  that  exhibit 
nonobvious  behavior  are  the  subject  of  Sect.  7.9.  Case-control  studies  are  a  common 
design,  which  offer  interesting  inferential  challenges  with  respect  to  inference,  and 
are  described  in  Sect.  7.10.  Concluding  comments  appear  in  Sect.  7.1 1.  Section  7.12 
gives  references  to  more  in-depth  treatments  of  binary  modeling  and  to  source 
materials. 
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7.2  Motivating  Examples 
7.2.1  Outcome  After  Head  Injury 

We  will  illustrate  methods  for  binary  data  using  the  data  first  encountered  in 
Sect.  1.3.2.  The  binary  response  is  outcome  after  head  injury  (dead/alive),  with 
four  discrete  covariates:  pupils  (good/poor),  coma  score  (depth  of  coma,  low/high), 
hematoma  present  (no/yes),  and  age  (categorized  as  1-25,  26-54,  >55).  These  data 
were  presented  in  Table  1.1,  but  it  is  difficult  to  discern  patterns  from  this  table.  In 
general,  cross-classified  data  such  as  these  may  be  explored  by  looking  at  marginal 
and  conditional  tables  of  counts  or  frequencies.  Figure  7.1  displays  conditional 
frequencies,  with  panel  (a)  corresponding  to  low  coma  score  and  panel  (b)  to  high 
coma  score.  These  plots  suggest  that  the  probability  of  death  increases  with  age, 
that  a  low  coma  score  is  preferable  to  a  high  coma  score,  and  that  good  pupils  are 
beneficial.  The  association  with  the  hematoma  variable  is  less  clear.  The  sample 
sizes  are  lost  in  these  plots,  which  makes  interpretation  more  difficult. 


7.2.2  Aircraft  Fasteners 

Montgomery  and  Peck  (1982)  describe  a  study  in  which  the  compressive  strength 
of  fasteners  used  in  the  construction  of  aircraft  was  examined.  Table  7.1  gives  the 
total  number  of  fasteners  tested  and  the  number  of  failures  at  a  range  of  pressure 
loads.  We  see  that  the  proportion  failing  increases  with  load.  For  these  data  we  will 
aim  to  find  a  curve  to  adequately  model  the  relationship  between  the  probability  of 
fastener  failure  and  load  pressure. 


Fig.  7.1  Probability  of  death  after  head  injury  as  a  function  of  age,  hematoma  score,  and  pupils: 
Panels  (a)  and  (b)  are  for  low  and  high  coma  scores,  respectively 
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Table  7.1  Number  of 

Load  (psi) 

Failures 

Sample  size 

Proportion  failing 

aircraft  fastener  failures  at 

2,500 

10 

50 

0.20 

specified  pressure  loads 

2,700 

17 

70 

0.24 

2,900 

30 

100 

0.30 

3,100 

21 

60 

0.35 

3,300 

18 

40 

0.45 

3,500 

43 

85 

0.51 

3,700 

54 

90 

0.60 

3,900 

33 

50 

0.66 

4,100 

60 

80 

0.75 

4,300 

51 

65 

0.78 

7.2.3  Bronchopulmonary  Dysplasia 

We  describe  data  from  van  Marter  et  al.  (1990)  and  subsequently  analyzed  by 
Pagano  and  Gauvreau  (1993)  on  the  absence/presence  of  bronchopulmonary  dys¬ 
plasia  (BPD)  as  a  function  of  birth  weight  (in  grams)  for  n  =  223  babies.  BPD 
is  a  chronic  lung  disease  that  affects  premature  babies.  In  this  study,  BPD  was 
defined  as  a  function  of  both  oxygen  requirement  and  compatible  chest  radiograph, 
with  147  of  the  babies  having  neither  characteristic  by  day  28  of  life.  We  take  as 
illustrative  aim  the  prediction  of  BDP  using  birth  weight,  the  rationale  being  that  if 
a  good  predictive  model  can  be  found,  then  measures  could  be  taken  to  decrease  the 
probability  of  BPD.  There  are  a  number  of  caveats  that  should  be  attached  to  this 
analysis.  First,  these  data  are  far  from  a  random  sample  of  births,  as  they  are  sampled 
from  intubated  infants  with  weights  less  than  1,751  g  (so  that  all  of  the  babies  are 
of  low  birth  weight).  In  general,  an  estimate  of  the  incidence  of  BPD  is  difficult  to 
tie  down,  in  part,  because  of  changes  in  the  definition  of  the  condition.  Allen  et  al. 
(2003)  provide  a  discussion  of  this  issue  and  report  that,  of  preterm  infants  with 
birth  weights  less  than  1,000  g,  30%  develop  BPD.  Second,  a  number  of  additional 
covariates  would  be  available  in  a  serious  attempt  at  prediction,  including  gender 
and  the  medication  used  by  the  mothers. 

Figure  7.2  displays  the  BPD  indicator,  plotted  as  short  vertical  lines  at  0  and  1,  as 
a  function  of  birth  weight.  Visual  assessment  suggests  that  children  with  lower  birth 
weight  tend  to  have  an  increased  chance  of  BPD.  It  is  hard  to  discern  the  shape  of  the 
association  from  the  raw  binary  data  alone,  however,  since  one  is  trying  to  compare 
the  distributions  of  zeros  and  ones,  which  is  difficult.  This  example  is  distinct  from 
the  aircraft  fasteners  because  the  latter  contained  multiple  responses  at  each  x  value. 
Binning  on  the  basis  of  birthweight  and  plotting  the  proportions  with  BPD  in  each 
bin  would  provide  a  more  informative  plot. 
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Fig.  7.2  Indicator  of 
bronchopulmonary  dysplasia 
(BPD),  as  a  function  of  birth 
weight.  The  short  vertical 
tines  at  0  and  1  indicate  the 
observed  birth  weights  for 
non-BPD  and  BPD  infants, 
respectively.  The  dashed 
curve  corresponds  to  a 
logistic  regression  fit  and  the 
dotted  curve  to  a 
complementary  log-log 
regression  fit 


7.3  The  Binomial  Distribution 


7.3.1  Genesis 

In  the  following  we  will  refer  to  the  basic  sampling  unit  as  an  individual.  Let  Z 
denote  the  Bernoulli  random  variable  with 

Pr(Z  =  z\p)=pz(l-p)1~*, 


z  =  0,1,  and 

P  =  P*{Z  =  1  |  p), 

for  0  <  p  <  1.  For  concreteness,  we  will  call  the  Z  =  1  outcome  a  positive 
response.  A  random  variable  taking  two  values  must  have  a  Bernoulli  distribution, 
and  all  moments  are  determined  as  functions  of  p.  In  particular,  var(Z  |  p)  = 
p(l—p)  so  that  there  is  no  concept  of  underdispersion  or  overdispersion  for  a 
Bernoulli  random  variable. 

Suppose  there  are  N  individuals,  and  let  Z:)  denote  the  outcome  for  the  yth 
individual,  j  =  1  Also  let  Y  =  J2jL  i  Zj  be  the  total  number  of  individuals 

with  a  positive  outcome,  and  suppose  that  each  has  equal  probabilities,  that  is,  p  = 
Pi  =  ...  =  pm-  Under  the  assumption  that  the  Bernoulli  random  variables  are 
independent , 

y  |  p  ~  Binomial(iV,  p) 

so  that 

Pv(Y  =  y\p)=^py(l-p)1-y,  (7.1) 

for  y  =  0,1,...,  N. 
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Constant  p  =  Pj,  j  =  1 , ,N,  over  the  N  individuals  is  not  necessary  for  Y  to 
follow  a  binomial  distribution.  Suppose  that  individual  j  has  probability  pj  drawn 
at  random  from  a  distribution  with  mean  p.  In  this  case. 


E[Zj]=E[E(Zj\pj)\=p 


and 

Y  |  p  ~  Binomial(./V,f>).  (7.2) 

Crucial  to  this  derivation  is  the  assumption  that  pj  are  independent  draws  from 
the  distribution  with  mean  p,  which  means  that  the  Zj  are  also  independent  for 
j  =  1, ...  ,N.  Alternative  scenarios  are  described  in  the  context  of  overdispersion 
in  Sect.  7.5. 

We  give  a  second  derivation  of  the  binomial  distribution.  Suppose  Yj  |  A, 
Poisson(Aj),  j  =  1,2  are  independent  Poisson  random  variables  with  rates  Ar 
Then, 

>i  |  Yi  +  Yz,p  ~  Binomial(li  +  Y2,p), 
withp  =  Ai/(Ai  +  A2)  (Exercise  7.3). 


7.3.2  Rare  Events 

Suppose  that  Y  |  p  ~  Binomial(A7,  p)  and  that  p  — >  0  and  N  — >  00,  with  A  = 
Np  fixed  (or  tending  to  a  constant).  Then  Exercise  7.1  shows  that,  in  the  limit, 
Y  |  A  ~  Poisson(A).  Approximating  the  binomial  distribution  with  a  Poisson  has 
a  number  of  advantages.  Computationally,  the  Poisson  model  can  be  more  stable 
than  the  binomial  model.  Also,  A  >  0  can  be  modeled  via  a  loglinear  form  which 
provides  a  more  straightforward  interpretation  than  the  logistic  form,  log [p/ (1  —  p)]. 
The  following  example  illustrates  one  use  of  this  result  for  obtaining  a  closed-form 
distribution  when  counts  are  summed. 


Example:  Lung  Cancer  and  Radon 

In  Sect.  1.3.3  we  introduced  the  lung  cancer  dataset,  with  Yt  being  the  number  of 
cases  in  area  i.  A  possible  model  for  these  data  is 

Yi  |  9i  ~  Poisson(£;A),  (7.3) 

where  Ei  is  the  expected  number  of  cases  based  on  the  age  and  gender  breakdown 
of  area  i  and  0,  is  the  relative  risk  associated  with  the  area,  for  i  =  1, . . . ,  n. 

A  formal  derivation  of  this  model  is  as  follows  (see  Sect.  6.5  for  a  related 
discussion).  Let  Yi:j  be  the  disease  counts  in  area  %  and  age-gender  stratum  j  and 
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Njj  the  associated  population,  i  =  1, . . . ,  n.  j  =  1, . . . ,  J.  In  the  Minnesota  study, 
we  have  J  =  36,  corresponding  to  male/female  and  18  age  bands:  0-4,  5-9,. . . ,  80- 
84,  85+.  We  only  have  access  to  the  total  counts  in  the  area,  Vt ,  and  so  we  require  a 
model  for  this  sum.  One  potential  model  is  Yp  |  pij  ~  B i nom ial  ( Ni:j .  p-p ) ,  with  pjj 
the  probability  of  lung  cancer  diagnosis  in  area  i,  stratum  j.  With  binomial  Y,:] ,  the 
distribution  of  Yi  =  ]Cj=i  is  a  convolution,  which  is  unfortunately  awkward  to 
work  with.  For  example,  for  J  =  2, 


Pr (yi  |  Pn,Pi2) 


EUi 

yn=l 


Na 

Vi  i 


Ni  2 

Vi  -  Vi  l 


(i  -  (i  -  Pl2)N^-y>+y^ 


where  =  max(0 ,yi  —  A/ 2 ) ,  ut  =  min(iVji,  t/j),  gives  the  range  of  admissible 
values  that  yn  can  take,  given  the  margins  Y; ,  N.;  —  Yu  —  Yi 2 ,  Nn ,  Ni 2 .  Lung  cancer 
is  statistically  rare,  and  so  we  can  use  the  Poisson  approximation  to  give  Yi:/  \  p7  J  ~ 
P()iss()n(A')?pI)).  The  distribution  of  the  sum  Y,  is  then  straightforward: 


Yi  |  pi  1, . . .  ,pu 


Poisson 


(7.4) 


There  are  insufficient  data  to  estimate  the  n  x  J  probabilities  p^,  and  so  it  is 
common  to  assume  ptj  =  6i  x  qj ,  where  qj  are  a  set  of  known  reference  stratum- 
specific  rates  and  0t  is  an  area-specific  term  that  summarizes  the  deviation  of  the 
risks  in  area  i  from  the  reference  rates.  Therefore,  this  model  assumes  that  the  effect 
on  risk  of  being  in  area  i  is  the  same  across  stratum.  Usually,  the  qj  are  assumed 

known.  Consequently,  (7.4)  simplifies  to  Yt  \  9i  ^  Poisson  (Oi  Nijqj'j,  and 

substituting  the  expected  numbers  Ei  =  ^2j=i  ^ijQj  produces  model  (7.3). 


7.4  Generalized  Linear  Models  for  Binary  Data 
7.4.1  Formulation 

Let  Z,  j  =0/1  denote  the  absence/presence  of  the  binary  characteristic  of  interest 
in  each  of  the  j  =  1 .....  Ay  trials,  with  i  =  \ ,  n  different  “conditions.”  Let 
Yi  =  ^2^=1  Zij  denote  the  number  of  positive  responses  and  N  =  Ay  the 

total  number  of  trials.  Further,  suppose  there  are  k  explanatory  variables  recorded 
for  each  condition,  and  let  xt  =  [1 ,  Xu , . . . ,  x,j.}  denote  the  row  vector  of  dimension 
1  x  [k  +  1)  for  i  =  1, . . . ,  n.  We  now  wish  to  model  the  probability  of  a  positive 
response  p(xi),  as  a  function  of  xt,  in  order  to  identify  structure  within  the  data. 
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We  might  naively  model  the  observed  proportion  via  the  linear  model 

Yi  _  _ 

^  T 

for  i  =  1 , ,n.  There  are  a  number  of  difficulties  with  such  an  approach.  First, 
the  observed  proportions  must  lie  in  the  range  [0, 1],  while  the  modeled  probability 
Xi/3  is  unrestricted.  We  could  attempt  to  put  constraints  on  the  parameters  in  order 
to  alleviate  this  drawback,  but  this  is  inelegant  and  soon  becomes  cumbersome  with 
multiple  explanatory  variables.  The  resultant  inference  is  also  difficult  due  to  the 
restricted  ranges.  The  second  difficulty  is  that  we  saw  in  Sect.  5.6.4  that  in  the  usual 
linear  model  framework,  an  appropriate  mean-variance  model  is  crucial  for  well- 
calibrated  inference  (unless  sandwich  estimation  is  turned  to).  A  linear  model  is 
usually  associated  with  error  terms  with  constant  variance,  but  this  is  not  appropriate 
here  since 

[Y  \  _  p(3?i)[l  -p(gj)] 

VJvJ  Ni 

so  that  the  variance  changes  with  the  mean.  The  generalized  linear  model,  intro¬ 
duced  and  discussed  in  Sect.  6.3,  can  rectify  these  deficiencies.  For  sums  of  binary 
variables,  the  binomial  model  is  a  good  starting  point. 

The  binomial  model  is  a  member  of  the  exponential  family,  specifically  Y  |  p  ~ 
Binomial ((V.  p),  that  is,  (7.1),  translates  to 


p{y  I  p)  =  exp 


+  (Vlog(l  —  p)  , 


(7.5) 


which  provides  the  stochastic  element  of  the  model.  For  the  deterministic  part,  we 
specify  a  monotonic,  differentiable  link  function: 

g[p(x)]  =  x(3.  (7.6) 


The  exponential  family  is  appealing  from  a  statistical  standpoint  since  correct 
specification  of  the  mean  function  leads  to  consistent  inference,  since  the  score 
function  is  linear  in  the  data  (this  function  is  given  for  the  logistic  model  in  (7.12)). 
With  a  GLM,  the  computation  is  also  usually  straightforward  (Sect.  6.5.2).  Non¬ 
linear  models  can  also  be  considered,  however,  if  warranted  by  the  application. 
For  example,  Diggle  and  Rowlingson  (1994)  considered  modeling  disease  risk  as 
a  function  of  distance  x  from  a  point  source  of  pollution.  These  authors  desired  a 
model  for  which  disease  risk  returned  to  baseline  as  x  — >  oo  and  suggested  a  model 
for  the  odds  of  the  form 
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Pr (Z  =  1  |  x) 
Pr (Z  =  0  |  x) 


Po  [l  +/3iexp(-/32ai2)], 


with  po  corresponding  to  baseline  odds.  Pi  corresponding  to  the  excess  odds  at 
x  =  0  (i.e.,  at  the  point  source),  and  p->  determining  the  speed  at  which  the  odds 
decline  to  baseline.  Such  nonlinear  models  are  computationally  more  difficult  to  fit 
but  produce  consistent  parameter  estimates,  if  combined  with  an  exponential  family. 


7.4.2  Link  Functions 

From  (7.5)  we  see  that  the  so-called  canonical  link  is  the  logit  9  =  log  [p/ (1  —  p)]. 
We  will  see  that  logistic  regression  models  of  the  form 


(7.7) 


offer  a  number  of  advantages  in  terms  of  computation  and  inference.  This  link 
function  is  by  far  the  most  popular  in  practice,  and  so  Sect.  7.6  is  dedicated  to 
logistic  regression  modeling. 

Other  link  functions  that  may  be  used  for  binomial  data  include  the  probit, 
complimentary  log-log,  and  log-log  links.  The  probit  link  is 


$  1  [p{x)]  =  x(3 , 


where  <&[■}  is  the  distribution  function  of  a  standard  normal  random  variable.  This 
link  function  generally  produces  similar  inference  to  the  logistic  link  function.  The 
logistic  and  probit  link  functions  are  symmetric  in  the  sense  that  g{p)  =  — </(  1  —  p). 
The  complementary  log-log  link  function  is 


log  {-log  [1  -p{x)]}  =  x(3, 


(7.8) 


to  give 


p(x)  =  1  —  exp  [—  exp(at/9)] 


which  is  not  symmetric.  Hence,  the  log-log  link  model 


log  {- log  [p(a;)]}  =  xf3 


with 


p(x)  =  exp[—  exp(— x(3)\ 
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may  also  be  used  and  will  not  produce  the  same  inference  as  (7.8).  If  fjcui/)  and 
gu{-)  represent  the  complementary  log-log  and  log-log  links,  respectively,  then  the 
two  are  related  via  gc LL(p)  =  -gLL(  1  -  p). 


7.5  Overdispersion 

Overdispersion  is  a  phenomena  that  occurs  frequently  in  applications  and,  in  the 
binomial  data  context,  describes  a  situation  in  which  the  variance  var(  Y,  \  p, ) 
exceeds  the  binomial  variance  NiPi{\  —  Pi). 

Often  overdispersion  occurs  due  to  clustering  in  the  population  from  which  the 
individuals  were  drawn.  To  motivate  a  variance  model,  suppose  for  simplicity  that 
the  Ni  individuals  for  whom  we  measure  outcomes  in  trial  i  are  actually  broken 
into  Ci  clusters  of  size  ki  so  that  N,  =  Ci  x  hi.  These  clusters  may  correspond 
to  families,  geographical  areas,  genetic  subgroups,  etc.  Within  the  cth  cluster,  the 
number  of  positive  responders  YiC  has  distribution  YiC  |  pic  ~ind  binomial(fci,piC), 
where  each  piC  is  drawn  independently  from  some  distribution,  for  c  =  1 , . . . .  C, . 
Let  PiC  represent  a  random  variable  with 

E  [Pic]  =  pi 

var(Pic)  =  T?pi{l-pi), 

where  the  variance  is  written  in  this  form  for  convenience  (as  we  see  shortly).  In  the 
following  we  will  use  expressions  for  iterated  expectation,  variance,  and  covariance, 
as  described  in  Appendix  B.  Then,  letting  Yt  =  Yic> 

Ci  Ci 

=  ^EPic  [E {Ylc  |  Pic)\  =  ^E rJkiPic]  =  NiPi. 

C—  1  C=  1 

Turning  to  the  variance, 

var(li)  =  var  ( ^  Yi^\  =  ^var(Fic), 

\c=  1  /  C—  1 

since  the  counts  are  independent,  as  each  pic  is  drawn  independently.  Continuing 
with  this  calculation  and  exploiting  the  iterated  variance  formula, 


E[Y»]  =  E 


Ci 


Y 
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Ci 

var(Yi)  =  {E  [var(Fic  |  pic)\  +  var  (E [Yic  \  pic ])} 

C=1 

Ci 

=  X]  {Epic[fciPic(l  -  Pic)]  +  var Pia{kiPlc)} 

C=1 

Ci 

=  X]  ~  ki  [var(pic)  +  E[Pic]2]  +  k2T2Pi(  1  -  ft)} 

C=1 

=  iViPi(l  -  Pi)  x  [l  +  (hi  -  1)t2] 

=  NiPi{  1  -Pi)e r2. 

Hence,  the  within-trial  clustering  has  induced  excess-binomial  variation.  Suppose 
each  cluster  is  of  size  ki  =  1  (i.e.,  Ci  =  iVj);  then  we  recover  the  binomial 
case  (7.2).  The  above  derivation  requires  1  <  af  <  ki  <  Ni,  since  0  <  of  <  1 
(McCullagh  and  Nelder  1989,  Sect.  4.5.1).  If  we  were  to  assume  a  second  moment 
model  with  a  common  of  =  a2  to  give 


var(Fi)  =  NiPi(l  -  pi)a2  (7.9) 

then  the  constraint  becomes  a2  <  Ni,  which  is  unfavorable,  but  will  rarely  be  a 
problem  in  practice. 

If  we  have  a  single  cluster,  that  is,  Ci  =  1,  then  ki  =  Ni  and 

var(li)  =  NiPi(  1  -  pi)  x  [1  +  (Nt  -  1  )r2]  .  (7.10) 

Suppose  Zij,  j  =  1 ,Ni  are  the  binary  outcomes  within-trial  i  so  that  \\  = 
Zij.  Then,  for  the  case  of  a  single  cluster  {Ci  =  1), 


CO V^Zij,  Zik)  E[cOv(^iy,  Zifc  |  Pil)]  “t“  COv(E[^iy  |  Pij],^l]Zik  |  Pifc]) 

=  coy  Pil  (Pa,  Pa) 

=  var(Pii )  =  T,2Pi(l  —  Pi), 

so  that  t2  is  the  correlation  between  any  two  outcomes  in  trial  i. 

We  now  discuss  a  closely  related  scenario  in  which  we  start  by  assuming  that 
outcomes  within  a  trial  have  correlation  rf .  Then  (Exercise  7.4), 

var(Fi)  =  ^(1  -^)  x  [l  +  (JV,  -  l)r2]  .  (7.11) 

Notice  that,  unlike  the  derivation  leading  to  (7.10),  underdispersion  can  occur 
if  t?  <  0.  The  equality  of  (7.10)  and  (7.11)  shows  that  the  effect  of  either  a 
random  response  probability  or  positively  correlated  outcomes  within  a  trial  is 
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indistinguishable  marginally  (unless  one  is  willing  to  make  assumptions  about  the 
within-trial  distribution,  but  such  assumptions  are  uncheckable). 

Inferentially,  two  approaches  are  suggested.  We  could  specify  the  first  two  mo¬ 
ments  only  and  use  quasi-likelihood.  This  route  is  taken  in  Sect.  7.6.3.  Alternatively, 
one  can  assume  a  specific  distributional  form  and  then  proceed  with  parametric 
inference,  as  we  now  illustrate. 

The  most  straightforward  way  to  model  overdispersion  parametrically  is  to 
assume  the  binomial  probability  arises  from  a  conjugate  beta  model.  This  model  is 

Yi  |  q.;  ~  Binomial(Ar,,  q,j) 
q-i  ~  Beta(al,6i), 


where  we  can  parameterize  as  a*  =  dpi ,  bi  =  d(  1  —  pi )  so  that 


var  (p^ 


PiO- ~Pi) 
d  +  1 


An  obvious  choice  of  mean  model  is  the  linear  logistic  model 


_  exp(xi/3) 

1  +  exp(xi(3) 

Notice  that  d  =  0  corresponds  to  the  binomial  model.  Integration  over  the  random 
effects  results  in  the  beta-binomial  marginal  model: 


Pr  (Yi=yi) 


{  Ni\  r(ai  +  M  r(ai  +  Vi)r(bi  +  Ni  -  yj) 

V  yi  )  r{ai)r(bi)  r{<n  +  bt  +  Ni ) 


The  marginal  moments  are 


E  \Yi]  =  Nipt  =  Ni 
var(Fj)  =  NiPi(l-pi) 


CLi  +  b , 

<ii  +  bi  +  Ni 
di  +  bi  +  1 


confirming  that  there  is  no  overdispersion  when  Ar,  =  1 .  This  variance  is  also 
equal  to  (7.10),  with  the  assumption  of  constant  Yf  on  recognizing  that  r2  = 
(di  +  bi  +  l)-1  =  l/(d+ 1).  Unfortunately,  the  log-likelihood  l(/3,  d)  is  not  easy  to 
deal  with  due  to  the  gamma  functions.  More  seriously,  the  beta-binomial  distribution 
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is  not  of  exponential  family  form  and  does  not  possess  the  consistency  properties  of 
distributions  within  this  family. 

Liang  and  McCullagh  (1993)  discuss  the  modeling  of  overdispersed  binary  data. 
In  particular,  they  suggest  plotting  residuals 

Vi  ~  NSi 

VNiPi(  1  ~Pi) 


against  N.i:  in  order  to  see  whether  there  is  any  association,  which  may  help  to  choose 
between  models  (7.9)  and  (7.10). 


7.6  Logistic  Regression  Models 


7. 6. 1  Parameter  Interpretation 


We  write  the  probability  of  Y  =  1  as  p(x)  to  emphasize  the  dependence  on 
covariates  x.  Model  (7.7)  is  equivalent  to  saying  that  the  odds  of  a  positive  outcome 
may  be  modeled  in  a  multiplicative  fashion,  that  is, 


.  _  ,  '  =  exp(a;/3)  =  exp(/30)  exp (xjfy). 

1  P\x)  ■1 

Less  intuition  is  evident  on  the  probability  scale  for  which 

exp(®/3) 

V\X)  =  - ; - -. 

1  +  exp(a;/3) 

The  transformation  used  here  is  known  as  the  expit  transform  (and  is  the  inverse  of 
the  logit  transform).  The  expression  for  the  probability  makes  it  clear  that  we  have 
enforced  0  <  p(x)  <  1. 

For  clarity,  we  discuss  interpretation  in  the  situation  in  which  p(x)  is  the 
probability  of  a  disease,  given  exposure  x.  Consider  first  the  logistic  regression 
model  in  the  case  where  the  exposures  have  no  effect  on  the  probability  of  disease: 


log 


p{x) 

1  ~p{x) 


A> 


In  this  case,  /3q  is  the  log  odds  of  disease  for  all  levels  of  the  exposures  x.  Equivalent 
statements  are  that  exp(/?o)  is  the  odds  of  disease  and  exp(/3o)/[l  +  exp(/?o)]  is  the 
probability  of  disease,  regardless  of  the  levels  of  x. 

Now  consider  the  situation  of  a  single  exposure  x  for  an  individual  with 
probability  p{x)  and 

Pix) 

l-p(x) 


log 


Po  +  Pix- 
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The  parameter  exp(/?o)  is  the  odds  of  disease  at  exposure  x  =  0,  that  is,  the  odds  for 
an  unexposed  individual.  The  parameter  exp(/3i)  is  the  odds  ratio  for  a  unit  increase 
in  x.  For  example,  if  exp(/3j )  =  2,  the  odds  of  disease  double  for  a  unit  increase  in 
exposure.  If  a;  is  a  binary  exposure,  coded  as  0/1,  then  exp(/?i)  is  the  ratio  of  odds 
when  going  from  unexposed  to  exposed: 


p(l)/[l  -p(!)]  =  exp(/30  +  /3i) 
p(0)/[! -p(0)]  exp(/30) 


=  exp(/3i). 


For  a  rare  disease,  the  odds  ratio  and  relative  risk,  which  is  given  by  p(x) /p(x—l) 
for  a  univariate  exposure,  are  approximately  equal,  with  the  relative  risks  being 
easier  to  interpret  (see  Sect.  7.10.2  for  a  more  detailed  discussion). 

Logistic  regression  models  may  be  defined  for  multiple  factors  and  continuous 
variables  in  an  exactly  analogous  fashion  to  the  multiple  linear  models  considered  in 
Chap.  5.  We  simply  include  on  the  right-hand  side  of  (7.6)  the  relevant  design  matrix 
and  associated  parameters.  This  is  a  benefit  of  the  GLM  framework  in  which  we 
have  linearity  on  some  scale,  though,  with  noncanonical  link  functions,  parameter 
interpretation  is  usually  more  difficult. 

The  logistic  model  may  be  derived  in  terms  of  the  so-called  tolerance  distribu¬ 
tions.  Let  U(x)  denote  an  underlying  continuous  measure  of  the  disease  state  at 
exposure  x.  We  observe  a  binary  version,  Y  (x),  of  this  variable  which  is  related  to 
U (x)  via 

0  if  U(x)  <  c 
1  if  C7  (a;)  >  c, 


Y(x)  = 


for  some  threshold  c.  Suppose  that  the  continuous  measure  follows  a  logistic 
distribution:  U(x)  ~  logistic  [p(x),  1].  This  distribution  is  given  by 


p(u  |  ix,  a)  = 


exp {(u-  p)/a} 
a{  1  +  exp[(u  —  p)/cr]}2  ' 


—  OO  <  U  <  00. 


The  logistic  distribution  function,  for  the  case  cr  =  1,  is 

exp(w  —  p) 


Pr  [U (x)  <  u\  = 


—  oo  <  u  <  oo. 


1  +  exp(u  —  p)  ’ 

From  this  model  for  U  (x) ,  we  can  obtain  the  probability  of  the  discrete  outcome  as 
r(x)  _  Pr  pr,)  .  1]  -  Fr  >  c]  = 

which  is  equivalent  to 


log 


P{  a 


l-p(x) 


=  p{x)  —  c. 
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So  far  we  have  not  specified  how  the  exposure  x  changes  the  distribution  of  the 
continuous  latent  variable  U(x).  We  assume  that  the  effect  of  exposure  to  x  is  to 
move  the  location  of  the  underlying  variable  U(x)  in  a  linear  fashion  via  //(x)  = 
a  +  bx ,  but  while  keeping  the  variance  constant.  We  then  obtain 


log 


P(x) 

1  ~p(x) 


Po  +  Pix, 


where  /30  =  a  —  c  and  fix  =  b,  that  is,  a  linear  logistic  regression  model. 

The  probit  and  complementary  log-log  links  may  similarly  be  derived  from 
normal  and  extreme-value1  tolerance  distributions,  respectively. 


7.6.2  Likelihood  Inference  for  Logistic  Regression  Models 


We  consider  the  logistic  regression  model 


log 


Pi(P) 

.1  ~Pi{P). 


Xil 3, 


where  Xi  is  a  1  x  (A;  +  1)  vector  of  covariates  measured  on  the  ith  individual  and  ft 
is  the  (fc  +  l)xl  vector  of  associated  parameters.  We  write  'Pi(ft)  to  emphasize  that 
the  probability  of  a  positive  response  is  a  function  of  ft.  For  the  general  binomial 
model  the  log-likelihood  is 

n  n 

KP)  =  5Z  Yi  l°SPi(P)  +  -  Yi )  !°g  [!  -  Pi(P)\  , 

i= 1  i= 1 


with  score  function 


S(0)  =  v  dpm  [Yl  ~  NiP@)] 
h  df3  p(p)[i-p(p)]' 


(7.12) 


Letting  fi  represent  the  n  x  1  vector  with  zth  element  /x.,  =  Nipftft)  allows  (7.12) 
to  be  rewritten  as 

S(ft)  =  D^V-1  [Y  -  fi(ft)\ ,  (7.13) 

where  D  is  the  n  x  (fc  +  1)  matrix  with  (z,  j)th  element  dp,i/dftj,  i  =  1, . . . ,  n, 
j  =  0, . . . ,  k  and  V  is  the  n  x  n  diagonal  matrix  with  xth  diagonal  element 
Nip(xi)  [1  —  p(xi)].  From  Sect.  6.5.1, 


1u  has  an  extreme- value  distribution  if  its  distribution  function  is  of  the  form  F(u)  =  1  — 
exp{—  exp[(«  -  tft/a]}. 
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I„(/3)1/2( Nfc+1(  0,  Ifc+i), 
where  In(/3)  =  D'V  1 Z9.  For  the  logistic  model, 

=  a^TVjp^l  -  pi) 

Vi,  =  iViPi(l  -  pi). 

Consequently,  the  score  takes  a  particularly  simple  form: 

5(/3)  =  [i-  -  . 

Hence,  at  the  maximum,  xTY  =  x'  fi(f:}  )  so  that  selected  sums  of  the  outcomes  (as 
defined  by  the  design  matrix)  are  preserved.  In  addition,  element  (j,  j’)  of  In((3) 
takes  the  form 

n 

^  ^  Xjj Xjjf  NjPi  ( 1  Pi)- 
i= 1 

We  now  turn  to  hypothesis  testing  and  consider  a  model  with  o  <  q  <  k 
parameters  and  fitted  probabilities  p.  The  log-likelihood  is 

n 

Kp)  =  [y*  l°SPi  +  ~  Vi)  1°g(1  ~  Pi)]  > 

i=  1 

with  the  maximum  attainable  value  occurring  at  =  yi/Ni.  The  deviance  is 


D  =  2  [Z(p)  -  l(p)\ 


=  2E 


i=  1  L 


Vi  log  (  ^  )  +  (Ni 
Vi  , 


Vi)  l°g 


f  Nj-yA 

V  Ni  ~  Vi  ) 


(7.14) 


where  p  is  the  vector  of  probabilities,  pi,  i  =  l, ...  ,n.  Notice  that  the  deviance 
will  be  small  when  yt  is  close  to  y, .  The  above  form  may  also  be  derived  directly 
from  (6.22)  under  a  binomial  model.  If  n,  the  number  of  parameters  in  the  saturated 
model  (which,  recall,  is  the  number  of  conditions  considered  and  not  the  total 
number  of  trials  which  is  given  by  N),  is  fixed,  then  under  the  hypothesized  model 
that  produced  p,  D  -P,i  Xn-q-  The  important  emphasis  here  is  on  fixed  n.  The 
outcome  after  head  injury  dataset  provides  an  example  in  which  this  assumption  is 
valid  since  there  are  ?z  =  2x2x2x3  =  24  binomial  trials  being  carried  out  at 
each  combination  of  the  levels  of  coma  score,  pupils,  hematoma,  and  age. 

When  n  is  not  fixed,  the  above  result  on  the  absolute  fit  is  not  relevant,  but  the 
relative  fit  may  be  assessed  by  comparing  the  difference  in  deviances.  Specifically, 
consider  nested  models  with  q:)  parameters  under  Hj,  j  =  0,1.  Further,  the 
estimated  probabilities  and  fitted  values  under  hypothesis  Hj  will  be  denoted  Pj 
and  y(j> ,  j  =  0,1,  respectively.  Then  the  reduction  in  deviance  is 
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Do  —  Dx  =  2  {l(p)  -  Z(po)  -  [Z(pO  -  Z(pi)]} 


=  2  [Z(pi)  -  Z(po)] 


=  *E 


+  (JVi 


2/i)  log 


Under  H0,  D0  -  D±  ->d  X2qi-qo- 

When  the  denominators  7V,:  are  small,  the  deviance  should  not  be  used,  as  we 
now  illustrate  in  the  case  of  Ni  =  1.  Suppose  that  Yj  \  pi  ~tnd  Bernoulli^),  with  a 
logistic  model,  logit(pi)  =  Xi/3,  for  i  =  1 ,n.  We  fit  this  model  using  maximum 
likelihood,  resulting  in  estimates  (3  and  fitted  probabilities  p.  In  this  case,  (7.14) 
becomes 


D  =  -2^j/,log 

i= 1 


n 

2  5>bg(1-fi) 
2=1 


=  -2 yJxf3  -  2  E  l°g(!  -  Pi) 
2=1 


=  -2/3Ta '7y  -  2  E  iogC1  -  Pi) 
2=1 


since  y  log  y  =  (1  —  y)  log(l  —  y)  =  0.  At  the  MLE,  xTy  =  xTp  so  that 


D  =  —2f3xTp  —  2  E]  l°g(l  ~Pi) 

i= 1 


and  the  deviance  is  a  function  only  of  /3.  In  other  words,  I)  is  a  deterministic 
function  of  /3  only  and  cannot  be  used  as  a  goodness  of  fit  statistic.  With  small  N,  , 
this  is  a  problem  for  any  link  function. 

An  alternative  goodness  of  fit  measure  for  a  model  with  q  parameters  is  the 
Pearson  statistic,  as  introduced  in  Sect.  6.5.3: 


X2  =  y-  (Yj  -  NSi)2 

“  NiPi(  1  ~PiY 


(7.15) 


with  X 2  — xi,-  q  under  the  null  and  under  the  assumption  of  fixed  n.  The  Pearson 
statistic  also  has  problems  with  small  Nt .  For  example,  for  the  model  Yt  \  p 
Bernoulli(p),  p  =  y  and 


A'2 


y-  (Vi  ~  V )2 

hi  y^-y) 


=  n 
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which  is  not  a  useful  goodness  of  fit  measure  (McCullagh  and  Nelder  1989, 
Sect.  4.4.5).  The  deviance  also  has  problems  under  this  Bernoulli  model  (Exer¬ 
cise  7.5). 


7. 6.3  Quasi-likelihood  Inference  for  Logistic  Regression 
Models 

As  we  saw  in  Sect.  6.6,  an  extremely  simple  and  appealing  manner  of  dealing  with 
overdispersion  is  to  assume  the  model 


E[YZ  |  (3\  =  NiPi  (/3) 
var(Yi  |  (3)  =  aNiPi(/3)  [1  -  Pi(f3)\ , 

with  co |  (3)  =  0,  for  i  f  j.  Under  this  model,  due  to  the  proportionality 
of  the  variance  model,  the  maximum  quasi-likelihood  estimator  satisfies  the  score 
function  (7.12),  since  the  value  of  a  is  irrelevant  to  finding  the  root  of  the 
estimating  equation.  Hence,  the  quasi-likelihood  estimator  (3  corresponds  to  the 
MLE.  Interval  estimates  and  tests  are  altered,  however.  In  particular,  asymptotic 
confidence  intervals  are  derived  from  the  variance-covariance  a(DJV~1D)~1 .  An 
obvious  estimator  of  a  is  provided  by  the  method  of  moments,  which  corresponds 
to  the  Pearson  statistic  (7.15)  divided  by  n  —  fc  —  1.  This  estimator  is  consistent  if 
the  first  two  moments  are  correctly  specified.  The  reference  yf  distribution  under 
the  null  is  also  perturbed,  as  in  (6.27). 


7.6.4  Bayesian  Inference  for  Logistic  Regression  Models 

A  Bayesian  approach  to  inference  combines  the  likelihood  L{(3)  with  a  prior 
7t(/3),  with  a  multivariate  normal  distribution  being  the  obvious  choice.  For  the 
binomial  model  there  is  no  conjugate  distribution  for  general  regression  models. 
In  simple  situations  with  a  small  number  of  discrete  covariates,  one  could  specify 
beta  priors  with  known  parameters  for  each  combination  of  levels  and  obtain 
analytic  posteriors,  but  there  would  be  no  linkage  between  the  different  groups,  that 
is,  no  transfer  of  information.  With  multivariate  normal  priors,  computation  may 
be  carried  out  using  INLA  (Sect.  3.7.4),  though  this  approximation  strategy  may  be 
inaccurate  if  the  binomial  denominators  are  small  (Fong  et  al.  2010).  An  alternative 
is  provided  by  MCMC  (Sect.  3.8). 

As  discussed  in  Sect.  7.6.3,  it  is  common  to  encounter  excess-binomial  variation. 
This  may  be  dealt  with  in  a  Bayesian  context  via  the  introduction  of  random  effects. 
The  beta-binomial  described  in  Sect.  7.5  provides  one  possibility.  An  alternative, 
more  flexible  formulation  would  assume  the  two-stage  model: 
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Stage  One:  The  likelihood: 


Y,  |  /3,  bi  ~ind  Binomial  [N,p(xi)} 


log 


P{Xj) 

1  ~p(Xi) 


Xi(3  +  bi 


Stage  Two:  The  random  effects  distribution: 


bi  |  CTq  r'-1 ad  N(0,  CT0). 


The  parameter  a q  controls  the  amount  of  overdispersion,  though  not  in  a  simple 
fashion.  A  Bayesian  approach  adds  priors  on  (3  and  ctq.  This  model  is  discussed 
further  in  Sect.  9.13. 


Example:  Outcome  After  Head  Injury 

Parameter  estimation,  whether  via  likelihood  or  Bayes,  is  straightforward  for  these 
data  given  a  particular  model.  The  difficult  task  in  this  problem  is  deciding  upon  a 
model.  If  prediction  is  all  that  is  required,  then  Bayesian  model  averaging  provides 
one  possibility,  and  this  is  explored  for  these  data  in  Chap.  12. 

In  exploratory  mode,  we  illustrate  some  approaches  to  model  selection.  In 
Sect.  4.8,  approaches  to  variable  selection  were  reviewed  and  critiqued.  In  partic¬ 
ular,  the  hierarchy  principle,  in  which  all  interactions  are  accompanied  by  their 
constituent  main  effect,  was  discussed.  Even  applying  the  hierarchy  principle  here, 
there  are  still  167  models  with  k  =  4  variables. 

We  begin  by  applying  forward  selection  (obeying  the  hierarchy  principle), 
beginning  with  the  null  model  and  using  AIC  as  the  selection  criteria.  This  leads 
to  a  model  with  all  main  effects  and  the  three  two-way  interactions  H .  P,  H .  A, 
and  P .  A.  Since  there  are  n  =  24  fixed  cells  here  we  can  assess  the  overall  fit. 
The  deviance  associated  with  the  model  selected  via  forward  selection  is  13.6  on 
13  degrees  of  freedom  which  indicates  a  good  fit.  Applying  backward  elimination 
produces  a  model  with  all  main  effects  and  five  two-way  interactions,  the  three 
selected  using  forward  selection  and,  in  addition,  H .  C  and  C  .  A.  This  model  has  a 
deviance  of  7.0  on  10  degrees  of  freedom,  so  the  overall  fit  is  good. 

Carrying  out  an  exhaustive  search  over  all  167  models  using  AIC  as  the  criterion 
leads  to  the  model  selected  with  backward  selection  (i.e.,  main  effects  plus  five  two- 
way  interactions).  Using  BIC  as  the  criteria  leads  to  a  far  simpler  model  with  the 
main  effects  H,  C,  and  A  only.  It  is  often  found  that  BIC  picks  simpler  models. 

We  consider  inference  for  the  model: 


1+H+P+C+A2+A3+H. P+H. A2+H. A3+P. A2+P.A3, 


(7.16) 
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Table  7.2  Likelihood  and 

Bayesian  estimates  and 

MLE 

Std.  err. 

Post,  mean 

Post  S.D. 

1 

-1.39 

0.26 

-1.37 

0.26 

uncertainty  measures  for 
model  (7.16)  applied  to  the 

H 

1.03 

0.35 

1.02 

0.35 

head  injury  data 

P 

2.05 

0.30 

2.04 

0.29 

c 

-1.52 

0.17 

-1.53 

0.17 

A2 

1.20 

0.33 

1.18 

0.32 

A3 

3.69 

0.48 

3.68 

0.47 

H.P 

-0.55 

0.34 

-0.55 

0.34 

H.A2 

-0.39 

0.36 

-0.38 

0.36 

H  .A3 

-1.32 

0.53 

-1.29 

0.52 

P.A2 

-0.57 

0.37 

-0.56 

0.36 

P  .A3 

-1.35 

0.49 

-1.33 

0.48 

that  is,  the  model  with  main  effects  for  hematoma  (H),  pupils  (P),  coma  score 
(C),  and  age  (with  A2  and  A3  representing  the  second  and  third  levels)  and  with 
interactions  between  hematoma  and  pupils  (H .  P),  hematoma  and  age  (H .  A2  and 
H .  A3),  and  pupils  and  age  (P  .  A2  and  P  .  A3). 

The  MLEs  and  standard  errors  are  given  in  Table  7.2,  along  with  Bayesian 
posterior  means  and  standard  deviations.  The  prior  on  the  intercept  was  taken  as 
flat,  and  for  the  10  log  odds  ratios,  independent  normal  priors  N(0,4.702)  were 
taken,  which  correspond  to  95%  intervals  for  the  odds  ratios  of  [0.0001,10000], 
that  is,  very  weak  prior  information  was  incorporated.  The  INLA  method  was  used 
for  computation.  The  original  scale  of  the  parameters  is  given  in  the  table,  which 
is  not  ideal  for  interpretation,  but  makes  sense  for  comparison  of  results  since  the 
sampling  distributions  and  posteriors  are  close  to  normal.  The  first  thing  to  note  is 
that  inference  from  the  two  approaches  is  virtually  identical.  This  is  not  surprising, 
given  the  relatively  large  counts  and  weak  priors. 

The  pupil  and  age  variables  and  their  interaction  at  the  highest  age  level  are 
clearly  very  important.  The  high  coma  score  parameter  is  large  and  negative,  and 
since  the  coma  variable  is  not  involved  in  any  interactions,  we  can  say  that  having  a 
high  coma  score  reduces  the  odds  of  death  by  exp(— 1.52)  =  0.22. 

The  observed  and  fitted  probabilities  are  displayed  in  Fig.  7.3  with  different 
line  types  joining  the  observed  probabilities  (as  in  Fig.  7.1).  The  vertical  lines  join 
the  fitted  to  the  observed  probabilities,  with  the  same  line  type  as  the  observed 
probabilities  with  which  they  are  associated.  There  are  no  clear  badly  fitting  cells. 


Example:  Aircraft  Fasteners 

Fet  Yi  be  the  number  of  fasteners  failing  at  pressure  Xi ,  and  assume  Yt  \  pi  ~ind 
Binomial(ni,pi),  i  =  1, . . . ,  n,  with  the  logistic  model  logit(pj)  =  (3q  +  /3±Xi.  This 
specification  yields  likelihood 
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Fig.  7.3  Probability  of  death 
after  head  injury  as  a  function 
of  age,  hematoma  score,  and 
pupils.  Panels  (a)  and  (b)  are 
for  low  and  high  coma  scores, 
respectively.  The  open  circles 
are  the  fitted  values.  The 
observed  values  are  joined  by 
different  line  types.  The 
residuals  y/n  —  p  are  shown 
as  vertical  lines  of  the  same 
line  type 


L(f3)  =  exp  A)  ^2  Vi  +  A  X!  XiVi  l0g  t1  +  exP(A>  +  A  Xi)\ 


(7.17) 


where  (3  =  [/3q,  /3i]t-  The  MLEs  and  variance-covariance  matrix  are 


'-5.34' 

'  2.98  x  10"1  -8.50  x  10~5' 

0.0015 

.  var(/3)  = 

-8.50  x  10~5  2.48  x  10~8 

(7.18) 


The  solid  line  in  Fig.  7.4  is  the  fitted  curve  p(x)  corresponding  to  the  MLE.  The 
fit  appears  good.  For  comparison  we  also  fit  models  with  complementary  log-log 
and  log-log  link  functions,  as  described  in  Sect.  7.4.2.  Figure  7.4  shows  the  fit  from 
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Fig.  7.4  Fitted  curves  for  the 
aircraft  fasteners  data  under 
three  different  link  functions 


these  models.  The  residual  deviance  from  logistic,  complementary  log-log,  and  log- 
log  links  are  0.37,  0.69,  and  1.7,  respectively.  These  values  are  not  comparable  via 
likelihood  ratio  tests  since  the  models  are  not  nested.  AIC  (Sect.  4.8.2)  can  be  used 
for  such  comparisons,  but  the  approximations  inherent  in  the  derivation  are  more 
accurate  for  nested  models  (Ripley  2004  and  Sect.  10.6.4).  The  differences  are  so 
small  here  that  we  would  not  make  any  conclusions  on  the  basis  of  these  numbers. 
Since  the  number  of  x  categories  is  not  fixed  in  this  example,  we  cannot  formally 
examine  the  absolute  fit  of  the  models.  In  Fig.  7.6,  we  see  that  residual  plots  for 
these  three  models  indicate  that  the  logistic  fit  is  preferable. 

A  95%  confidence  interval  for  the  odds  ratio  corresponding  to  a  500 psi  increase 
in  pressure  load  is 


exp 


500  x  /?i  ±  1.96  x  500  y  var(/?i) 


=  [1.86,2.53]. 


(7.19) 


We  now  present  a  Bayesian  analysis.  For  these  abundant  data  and  without  any 
available  prior  information,  the  improper  uniform  prior  7t(/3)  oc  1  is  assumed.  The 
posterior  is  therefore  proportional  to  (7. 17).  We  use  a  bivariate  Metropolis-Hastings 
random  walk  MCMC  algorithm  (Sect.  3.8.2)  to  explore  the  posterior.  A  bivariate 
normal  proposal  was  used,  with  variance-covariance  matrix  proportional  to  the 
asymptotic  variance-covariance  matrix,  var(/3),  (7.18).  This  matrix  was  multiplied 
by  four  to  give  an  acceptance  ratio  of  around  30%.  Panels  (a)  and  (b)  of  Fig.  7.5 
show  histograms  of  the  dependent  samples  from  the  posterior  (y  and  p\  ,  s  = 
1, . . . ,  S  =  500,  and  panel  (c)  the  bivariate  posterior.  The  posterior  median  for  (3  is 
[—5.36,  0.0015],  and  a  95%  posterior  interval  for  the  odds  ratio  corresponding  to  a 
500  psi  increase  in  pressure  is  identical  to  the  asymptotic  likelihood  interval  (7.19). 
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Fig.  7.5  Posterior  summaries  for  the  aircraft  fasteners  data:  (a)  p{/3o\y),  (b)  p(/3i\y),  (c) 
p(/3o,  /3i  \y),  (d)  p(exp(0)/[l  +  exp(0)]|y),  where  9  =  /3o  +  0ix,  that  is,  the  posterior  for  the 
probability  of  failure  at  a  load  of  x  =  3,000  psi 


We  now  imagine  that  it  is  of  interest  to  give  an  interval  estimate  for  the 
probability  of  failure  at  x  =  3,000 psi  (which  is  indicated  as  a  dashed  vertical  line 
on  Fig.  7.4).  An  asymptotic  95%  confidence  interval  for  9  =  /3g  +  PiX  is 

6  ±  1.96  x  \J var (9), 


where 


9  =  % o  +  a;/?i 

var(0)  =  var(/lo)  +  25-cov(/30,  /?i)  +  a;2var(/3i). 

Taking  the  expit  transform  of  the  endpoints  of  the  confidence  interval  on  the  linear 
predictor  scale  leads  to  a  95%  interval  of  [0.29,0.38].  Substitution  of  the  posterior 
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Table  7.3  A  generic  2x2 
table 


0 

II 

Y  =  1 

X  =  0 

?;oo 

yoi 

yo- 

X  =  1 

yio 

yu 

Vl- 

y  0 

y-i 

V- 

samples  /3^  to  give  expit(0^),  s  =  1, . . . ,  S  results  in  a  95%  interval  which  is 
again  identical  to  the  frequentist  interval. 


7.7  Conditional  Likelihood  Inference 

In  Sect.  2.4.2,  conditional  likelihood  was  introduced  as  a  procedure  that  could  be 
used  for  eliminating  nuisance  parameters.  In  this  chapter,  conditional  likelihood  will 
be  used  for  discrete  data,  which  we  denote  y.  Suppose  the  distribution  for  the  data 
can  be  represented  as, 

p(y  I  A,  4>)  (X  p(t  1  |  t2,  A )p(t2  |  A,  0),  (7.20) 

where  A  is  a  parameter  of  interest  and  0  is  a  nuisance  parameter.  Then  inference  for 
A  may  be  based  on  the  conditional  likelihood 

Lc{ A)  =  p(tx  |  t2,  A). 

Perhaps  the  most  popular  use  of  conditional  likelihood  leads  to  Fisher’s  exact 
test.  Consider  the  2x2  layout  of  data  shown  in  Table  7.3  with 

yoi  |  Po  ~  Binomial(y0-,Po) 

2/n  |  pi  ~  Binomial(yi.,pi), 
which  we  combine  with  the  logistic  regression  model: 

l06(i^k)=A 
los(rr^)  =A  +  A' 


Here, 

Pi/(1  ~Pi) 
P0/Q--P0) 


exp(/3i) 
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is  the  odds  of  a  positive  response  in  the  X  =  1  group,  divided  by  the  odds  of 
a  positive  response  in  the  X  =  0  group,  that  is,  the  odds  ratio.  This  setup  gives 
likelihood 


Pr(z/oi,yu  |  A),  A) 


evn0i  evi0o 

(1  +  eP°+P1)yi  (1  +  eP°)y°  ' 


(7.21) 


Now  [t/oij  S/ll]  implies  the  distribution  of  [yu,y. i],  so  we  can  write 


Pr(yii,y-i  |  A),  A)  = 


V  o- 

y-i  -  2/n 


eyu0i  ey  i0o 

(1  +  e^o+0i)yi-  (1  +  eP°)y°- ' 


We  now  show  that  by  conditioning  on  the  column  totals,  in  addition  to  the  row  totals, 
we  obtain  a  distribution  that  depends  only  on  the  parameter  of  interest  At-  Consider 


Pr(//n  |  y-i,Po,Pi) 


Pr(yu,y-i  I  A)> Pi) 
Pr(y.i  |  A,  A) 


where  the  marginal  distribution  is  obtained  by  summing  over  the  possible  values 
that  y ii  can  take,  that  is, 


Ml 

Pr(y.i  |  /3o,  Pi)  =  Pr(u’  2/-i  I  Po,Pi) 

U=Uq 


eu0i  eViPo 

(1  +  efo+P1^1  (1  +  eP°)y° 


where  uq  =  max(0,j/.i  —  y$.)  and  u\  =  min(j/1.,  y.i)  ensure  that  the  marginals 
are  preserved.  With  respect  to  (7.20),  A  =  /31;  0  =  /30,  fi  =  yn,  and  t2  =  y. i. 
Accordingly,  the  conditional  distribution  takes  the  form 


Pr(yn  |  y. i,/?i)  = 


(7.22) 


an  extended  hypergeometric  distribution.  We  have  removed  the  conditioning  on  /3q 
since  this  distribution  depends  on  /3\  only  (which  was  the  point  of  this  derivation). 
Inference  for  A  may  be  based  on  the  conditional  likelihood  (7.22).  In  particular,  the 
conditional  MLE  may  be  determined,  though  unfortunately  no  closed  form  exists. 

Conventionally,  estimates  of  /3q  and  A  would  be  determined  from  the  product 
of  binomial  likelihoods,  (7.21).  Unless  the  samples  are  small,  the  conditional  and 
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unconditional  MLEs  (and  associated  variances)  will  be  in  close  agreement,  but 
for  small  samples,  the  conditional  MLE  is  preferred  due  to  the  following  informal 
argument.  Consider  the  original  2x2  data  in  Table  7.3.  If  we  knew  y. i,  then  this 
alone  would  not  help  us  to  estimate  /3i,  but  the  precision  of  conclusions  about  /3i 
will  depend  on  this  column  total,  and  we  should  therefore  condition  on  the  observed 
value.  This  is  to  ensure  that  we  attach  to  the  conclusions  the  precision  actually 
achieved  and  not  that  to  be  achieved  hypothetically  in  a  particular  situation  that 
has  in  fact  not  occurred.  For  further  discussion,  see  Cox  and  Snell  (1989,  p.  27-29). 

To  derive  the  conditional  MLE,  first  consider  the  conditional  likelihood 


LC(P  i) 


c(t/ii)eyil/31 

E :iU0  c(u)e^ 


where 


The  (conditional)  score  is 


d  TU1  c(u)ue^u 

Sc(ft)  =  -Zr  logical)  =  2/11  -  ^u~u°  ~  . 


(7.23) 


The  extended  hypergeometric  distribution  is  a  member  of  the  exponential  family 
(Exercise  7.6)  and 


E[Sc(ft)] 


d 

Wi 


logLc(/3i) 


at  the  MLE.  Consequently,  from  (7.23),  we  can  use  the  equation  E[Yn  |  j3\\  =  y n 
to  solve  for  j3i.  Asymptotic  inference  is  based  on 

Wi)1/2  (ft  -  ft)  N(0, 1),  (7.24) 

where  the  (conditional)  information  is 


rj2 

Uft)  = -^  logE(ft) 

=  var(Yn  |  ft). 


ZUuLUoc(uWe^ 
E:L  Uoc(u)e^u 


ve:l  uAuy?xu ) 


It  is  straightforward  to  test  the  null  hypothesis  Hq  :  j3\  =  0  using  the  conditional 
likelihood.  When  /3i  =  0,  the  distribution  (7.22)  is  hypergeometric,  and  so 
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Table  7.4  Data  on  tumor 

Tumor 

appearance  within  rats 

Absent  Present 

Y  =  0 

Y  =  1 

Control 

A'  =  0 

13 

19 

32 

Treated 

A  =  1 

2 

21 

23 

15 

40 

55 

Pr(z/ll  |  V-l,Pl  =  0) 


(7.25) 


The  comparison  of  the  observed  y\\  with  the  tail  of  this  distribution  is  known  as 
Fisher’s  exact  test  (Fisher  1935).  Various  possibilities  are  available  to  obtain  a  two- 
sided  significance  level,  the  simplest  being  to  double  the  one-sided  p-value.  An 
alternative  is  provided  by  summing  all  probabilities  less  than  the  observed  table. 
Confidence  intervals  for  /3i  may  be  obtained  from  (7.24),  or  by  inverting  the  test.  See 
Agresti  (1990,  Sects.  3.5  and  3.6)  for  further  discussion;  in  particular,  the  problems 
of  the  discreteness  of  the  sampling  distribution  are  discussed. 


Example:  Tumor  Appearance  Within  Mice 

We  illustrate  the  application  of  conditional  likelihood  using  data  reported  by 
Essenberg  (1952)  and  presented  in  Table  7.4.  To  examine  the  carcinogenic  effects  of 
tobacco,  36  albino  mice  were  placed  in  an  enclosed  chamber  which  was  filled  with 
the  smoke  of  one  cigarette  every  12  h  per  day.  Another  group  of  mice  were  kept  in 
an  alternative  chamber  without  smoke.  After  1  year,  autopsies  were  carried  out  on 
those  mice  that  had  survived  for  at  least  the  first  2  months  of  the  experiment.  The 
data  in  Table  7.4  give  the  numbers  of  mice  with  and  without  tumors  in  the  “control” 
and  “treated”  groups. 

For  these  data,  the  permissible  values  of  yn  lie  between  uq  =  max(0, 40  — 
32)  =  8  and  U\  =  min(23,40)  =  23.  Under  Hq  :  /3i  =  0,  the  probabilities  of 
2/n  =  21,22,23,  from  (7.25),  are  0.00739,  0.00091,  and  0.00005,  which  sum  to 
0.00834,  the  one-sided  p-value.  The  simplest  version  of  the  two-sided  p-value  is 
therefore  0.0167,  which  would  lead  to  rejection  of  H0  under  the  usual  threshold  of 
0.05.  Summing  the  probabilities  of  more  extreme  tables  gives  a  p-value  of  0.0130. 

Denoting  by  /3“  the  (unconditional)  MLE  of  the  log  odds  ratio,  we  have 

Pi  =  log  (yyjy)  =  l°g(7.18)  =  1.97, 
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with  asymptotic  standard  error 


to  give  asymptotic  95%  confidence  interval  for  the  odds  ratio  of 


exp(1.97  ±  1.96  x  0.82)  =  [1.44, 35.8] . 


The  Wald  test  p- value  of  0.0166  is  very  close  to  that  obtained  from  Fisher’s  exact 
test.  The  conditional  MLE  is 


/?!  =  log(6.95)  =  1.93 


with  conditional  standard  error 


illustrating  the  extra  precision  gained  by  conditioning  on  y.\.  The  conditional 
asymptotic  95%  confidence  interval  for  the  odds  ratio  based  on  (7.24)  is 


exp(1.93  ±  1.96  x  0.61)  =  [2.11, 22.9] . 


7.8  Assessment  of  Assumptions 

In  general,  residual  analysis  is  subjective,  and  though  one  might  be  able  to 
conclude  that  a  model  is  inadequate,  concluding  adequacy  is  much  more  difficult. 
Unfortunately,  for  logistic  regression  models  with  binary  data,  the  assessment  is 
even  more  tentative.  Even  when  the  model  is  true,  little  can  be  said  about  the 
moments  and  distribution  of  the  residuals. 

We  briefly  review  Pearson  and  deviance  residuals  as  defined  for  GLMs  in 
Sect.  6.9.  Pearson  residuals  are  defined  as  e*  =  (1)  —  p^/y/vai (1)),  and  for 
Yi  |  pi  ~  Binomial(ni,pi),  we  obtain 


[niPi(  1  -  Pi)] 1/2 


i  =  1, . . . ,  n.  Pearson’s  statistic  is 
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showing  the  link  between  the  measures  of  local  and  absolute  fit. 

Deviance  residuals  are  defined  as 

e*  =  sign (yi  -  J1z)Vdz, 

i  =  1 ,N.  Note  that  the  deviance  D  =  Y^?-i(et)2  where  D  is  given  by  (7.14). 
For  binary  Kt  and  a  particular  value  of  p,,  the  residuals  can  only  take  one  of  two 
possible  values,  which  is  clearly  a  problem  (this  is  illustrated  later,  in  Fig.  7.8). 

Few  analytical  results  are  available  for  the  case  of  a  binomial  model,  but,  if 
the  model  is  correct,  both  the  Pearson  and  deviance  residuals  are  asymptotically 
normally  distributed.  Hence,  they  may  be  put  to  many  of  the  same  uses  as 
residual  defined  with  respect  to  the  normal  linear  regression  model  (as  described 
in  Sect.  5.1 1.3).  For  example,  residuals  may  be  plotted  against  covariates  x  and 
examined  for  outlying  values.  Interpretation  is  more  difficult,  however,  as  one  must 
examine  the  appropriateness  of  the  link  function  as  well  as  the  linearity  assumption. 
A  normal  QQ  plot  of  residuals  can  indicate  outlying  observations. 

Empirical  logits  log[(j/j  +  0.5)/(iVj  —  yt  +  0.5)]  are  useful  for  examining  the 
adequacy  of  the  logistic  linear  model.  The  addition  of  0.5  removes  problems  when 
Ui  =  0  or  Ni.  This  adjustment  is  optimal;  see  Cox  and  Snell  (1989,  Sect.  2.1.6) 
for  details.  The  mean-variance  relationship  can  be  examined  by  plotting  residuals 
versus  fitted  values.  In  particular,  different  overdispersion  models  may  be  compared, 
as  discussed  in  Sect.  7.5. 


Example:  Aircraft  Fasteners 

In  this  example,  the  denominators  are  relatively  large  (ranging  between  40  and 
100  for  each  of  the  10  trials),  and  so  the  residuals  are  informative.  Figure  7.6 
shows  Pearson  residuals  plotted  against  pressure  load  for  each  of  three  different  link 
functions.  On  the  basis  of  these  plots,  the  logistic  model  looks  the  most  reasonable 
since  there  are  runs  of  positive  and  negative  residuals  associated  with  the  other  two 
link  functions,  signifying  mean  model  misspecification. 


Example:  Outcome  After  Head  Injury 

The  binary  response  in  this  example  is  cross-classified  with  respect  to  factors  with 
2  or  3  levels.  We  saw  in  Fig.  7.3  that  the  fit  of  model  (7.16)  appeared  reasonable, 
though  the  distances  —  pi  that  are  displayed  as  vertical  lines  are  not  standardized, 
making  interpretation  difficult.  Figure  7.7  gives  a  normal  QQ  plot  of  the  Pearson 
residuals,  and  there  are  no  obvious  causes  for  concern  with  no  outlying  points. 
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Fig.  7.6  Pearson  residuals  versus  pressure  load  for  the  aircraft  fasteners  data  for  (a)  logistic  link 
model,  (b)  complementary  log-log  link  model,  and  (c)  log-log  link  model 


Fig.  7.7  QQ  plot  of  Pearson 
residuals  for  the  head  injury 
data 


a) 

Q. 

E 

CO 

w 


Theoretical  Quantiles 


Example:  BPD  and  Birth  Weight 

We  fit  a  logistic  regression  model 


Pr(F  =  1  |  cc) 


exp  {/30  +  fhx) 

1  +  exp(/3o  +  fox)  ’ 


(7.26) 


with  Y  =  0/1  corresponding  to  absence/presence  of  BPD  and  x  to  birth  weight.  The 
curve  arising  from  fitting  this  model  is  shown  in  Fig.  7.2,  along  with  the  curve  from 
the  use  of  the  complementary  log-log  link.  We  might  question  whether  either  of 
these  curves  is  adequate,  since  they  are  relatively  inflexible,  with  forms  determined 
by  two  parameters  only.  The  Pearson  residuals  from  the  two  models  are  plotted 
versus  birth  weight  in  Fig.  7.8.  The  binary  nature  of  the  response  is  evident  in  these 
plots,  and  assessing  whether  the  models  are  adequate  is  not  possible  from  this  plot. 
In  Chap.  1 1,  we  return  to  these  data  and  fit  flexible  nonlinear  models. 
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Birthweight  (grams) 


b 


Birthweight  (grams) 


Fig.  7.8  Pearson  residuals  versus  birth  weight  for  the  BPD  data:  (a)  logistic  model,  (b)  comple¬ 
mentary  log-log  model 


7.9  Bias,  Variance,  and  Collapsibility 


We  begin  by  summarizing  some  of  the  results  of  Sect.  5.9  in  which  the  bias  and 
variance  of  estimators  were  examined  for  the  linear  model.  Consider  the  models: 


E [Y  |  x,  z\  =  /30  +  fax  +  ft2z  (7.27) 

E[Y\x]  =/%  +  fix.  (7.28) 


First,  suppose  that  x  and  2  are  orthogonal.  Roughly  speaking,  if  2  is  related  to  Y, 
then  fitting  model  (7.27)  will  lead  to  a  reduction  in  the  variance  of  /3l5  and  E[/3i]  = 
E[/3*]  so  that  bias  is  not  an  issue.  When  x  and  2  are  not  orthogonal,  then  fitting 
model  (7.28)  will  lead  to  bias  in  the  estimation  of  f3\  since  /3*  reflects  not  only  x  but 
also  the  effect  of  2  through  its  association  with  x. 

In  this  section  we  discuss  these  issues  with  respect  to  logistic  regression  models. 
To  this  end,  consider  the  logistic  models: 


E[Y\x,z] 


exp  (/3q  +  f3ix  +  fcz) 

1  +  exp(/?0  +  Pix  +  /32z) 


(7.29) 


E\Y\x] 


exp(/?p  +  P{x)  _  exp  (/30  +  fax  +  p2z) 

1  +  exp(/?Q  +  Plx)  3|a:  1  +  exp^o  +  /3ix  +  /32z) 


(7.30) 


The  last  equation  indicates  that  determining  the  effects  of  omission  of  2  will  be  very 
hard  to  determine  due  to  the  nonlinearity  of  the  logistic  function.  As  we  illustrate 
shortly  though,  even  if  x  and  2  are  orthogonal,  E[/?i]  ^  Ef/3^].  Linear  models  for  the 
probabilities  are  more  straightforward  to  understand,  but,  as  discussed  previously, 
since  probabilities  are  constrained  [0,1],  such  models  are  rarely  appropriate  for 
binary  data. 
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Table  7.5  Illustration  of  Simpson’s  paradox  for  the  case  of  non-orthogonal  x  and  z 


z  =  0 

Z=1 

Marginal 

Y  =  0 

Y  =  1 

Y  =  0 

Y  =  1 

y  =  o  y  =  1 

Control 

x  =  0 

8 

2 

9 

21 

17  23 

Treatment 

x  =  1 

18 

12 

2 

8 

20  20 

Odds  Ratio 

1.6 

1.7 

0.7 

We  now  discuss  the  marginalization  of  effect  measures.  Roughly  speaking,  if 
an  effect  measure  is  constant  across  strata  (subtables)  and  equal  to  the  measure 
calculated  from  the  marginal  table,  it  is  known  as  collapsible.  Non-collapsibility 
is  sometimes  referred  to  as  Simpson’s  paradox  (Simpson  1951)  in  the  statistics 
literature.  As  in  Greenland  et  al.  (1999),  we  include  the  case  of  orthogonal  x  and 
z  in  Simpson’s  paradox,  though  first  illustrate  with  a  case  in  which  x  and  z  are 
non-orthogonal. 

Consider  the  data  in  Table  7.5  in  which  x  =  0/1  represents  a  control/treatment 
which  is  applied  in  two  strata  z  =  0/1,  with  a  binary  response  Y  =  0/1  being 
recorded.  In  both  z  strata,  the  treatment  appears  beneficial  with  odds  ratios  of  1 .6 
and  1.7.  However,  when  the  data  are  collapsed  over  strata,  the  marginal  association 
is  reversed  to  give  an  odds  ratio  of  0.7  so  that  the  treatment  appears  detrimental. 

Mathematically,  the  paradox  is  relatively  simple  to  understand.  Let 

Pxz  =  Pr(y  =  1  I  X  =  x,  Z  =  z) 

P*x  =  Pr(y  =  1\X  =x) 

be  the  conditional  and  marginal  probabilities  of  a  response  and  qx  =  Pr (Z  =  1  | 
X  =  x)  summarize  the  relationship  between  x  and  z,  for  x,z  =  0,1.  The  “paradox” 
reflects  the  fact  that  it  is  possible  to  have 

Poo  <  P 10  and  Poi  <  Pn, 

that  is,  the  probability  of  a  positive  response  being  greater  under  A'  =  1  for  both 
strata,  but 


poo(i  -  go)  +Poigo  =Po  >  p*  =PioU  -  gi)  +pnqi 

so  that  the  marginal  probability  of  a  positive  response  is  greater  under  x  =  0  than 
under  x  =  1.  For  the  data  of  Table  7.5, 


Poo=^  =  0-20,  pw  =  ^  =  0.43,  p(n  =  ^  =  0-7,  pn  =  ^  =  0.8 


23  20 

*5  =  40  =  0-58.  ^  =  40  =  0-50’ 


and 
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Table  7.6  Illustration  of  Simpson’s  paradox  for  the  case  of  orthogonal  x  and  z 


z  =  0 

2  =  1 

Marginal 

Y  =  0 

Y  =  1 

Y-  0 

Y  =  1 

Y  =  0 

Y  =  1 

Control 

x  =  0 

95 

5 

10 

90 

105 

95 

Treatment 

x  =  1 

90 

10 

5 

95 

95 

105 

Odds  ratio 

2.1 

2.1 

1.2 

with 


It  is  important  to  realize  that  the  paradox  has  nothing  to  do  with  the  absolute  values 
of  the  counts.  Reversal  of  the  association  (as  measured  by  the  odds  ratio)  cannot 
occur  if  qo  =  (f  \  (i.e.,  if  there  is  no  confounding),  but  the  odds  ratio  is  still  non- 
collapsible,  as  the  next  example  illustrates. 

We  now  consider  the  situation  in  which  =  q\ .  Such  a  balanced  situation  would 
occur,  by  construction,  in  a  randomized  clinical  trial  in  which  (say)  equal  numbers 
of  x  =  0  and  x  =  1  groups  receive  the  treatment.  We  illustrate  in  Table  7.6  in  which 
there  are  100  patients  in  each  of  the  four  combinations  of  x  and  2.  In  each  of  the 
z  stratum,  we  see  an  odds  ratio  for  the  treatment  as  compared  to  the  control  of  2.1. 
We  do  not  see  a  reversal  in  the  direction  of  the  association  but  rather  an  attenuation 
toward  the  null,  with  the  marginal  association  being  1.2. 

We  emphasize  that  the  marginal  estimator  is  not  a  biased  estimate,  but  is  rather 
estimating  a  different  quantity,  the  averaged  or  marginal  association.  A  second  point 
to  emphasize  is  that,  as  we  have  just  illustrated,  collapsibility  and  confounding  are 
different  issues  and  should  not  be  confused.  In  particular,  it  is  possible  to  have 
confounding  present  without  non-collapsibility,  as  discussed  in  Greenland  et  al. 
(1999). 

Another  issue  that  we  briefly  discuss  is  the  effect  of  stratification  on  the  variance 
of  an  estimator.  As  discussed  at  the  start  of  this  section,  if  x  and  z  are  orthogonal  but 
z  is  associated  with  y,  then  including  z  in  a  linear  model  will  increase  the  precision 
of  the  estimator  of  the  association  between  y  and  x.  We  illustrate  numerically  that 
this  is  not  the  case  in  the  logistic  regression  context,  again  referring  to  the  data  in 
Table  7.6.  Let  pxz  represent  the  probability  of  disease  for  treatment  group  x  and 
strata  z.  In  the  conditional  analysis  we  fit  the  model 


log 


Pxz 


1  Pxz 


Po 

Po  +  Px 

Po  +  Pz 

Po  +  Px  +  Pz 


for  x  =  0,  z  =  0 
for  x  =  1,  z  =  0 
for  x  =  0,  z  =  1 
for  x  =  1,  z  =  1, 


where  we  have  not  included  an  interaction  between  x  and  z.  This  results  in 
exp(/3x)  =  exp(0.75)  =  2.1,  as  expected  from  Table  7.6,  with  standard  error  0.40. 
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Now  suppose  we  ignore  the  stratum  information  and  let  p*  be  the  probability  of 
disease  for  treatment  group  x.  We  fit  the  model 


for  x  =  0 
for  x  =  1 


This  gives  exp(/?*)  =  exp(0.20)  =  1.2,  again  as  expected  from  Table  7.6,  but  with 
standard  error  0.20  which  is  a  reduction  from  the  conditional  model  and  is  in  stark 
contrast  to  the  behavior  we  saw  with  the  linear  model. 

In  any  cross-classified  table  the  summary  we  observe  is  an  “averaged”  measure, 
where  the  average  is  with  respect  to  the  population  underlying  that  table.  Consider 
the  right-hand  2  x  2  set  of  counts  in  Table  7.6,  in  which  we  had  equal  numbers  in 
each  strata  (which  mimics  a  randomized  trial).  The  odds  ratio  comparing  treatment 
to  control  is  1.2  here  and  is  the  effect  averaged  across  strata  (and  any  other 
variables  that  were  unobserved).  Such  measures  are  relevant  to  what  are  sometimes 
referred  to  as  population  contrasts.  Depending  on  the  context,  we  will  often  wish 
to  include  additional  covariates  in  order  to  obtain  effect  measures  most  relevant  to 
particular  subgroups  (or  subpopulations).  The  issues  here  have  much  in  common 
with  marginal  and  conditional  modeling  as  discussed  in  the  context  of  dependent 
data  in  Chaps.  8  and  9. 

We  emphasize  that,  as  mentioned  above,  the  difference  between  population  and 
subpopulation-specific  estimates  should  not  be  referred  to  as  “bias”  since  different 
quantities  are  being  estimated.  As  a  final  note,  the  discussion  in  this  section  has 
centered  on  logistic  regression  models,  but  the  same  issues  hold  for  other  nonlinear 
summary  measures. 


7.10  Case-Control  Studies 


In  this  section  we  discuss  a  very  popular  design  in  epidemiology,  the  case-control 
study.  In  the  econometrics  literature,  this  design  is  known  as  choice-based  sampling. 


7.10.1  The  Epidemiological  Context 

Cohort  (prospective)  studies  investigate  the  causes  of  disease  by  proceeding  in  the 
natural  way  from  cause  to  effect.  Specifically,  individuals  in  different  exposure 
groups  of  interest  are  enrolled,  and  then  one  observes  whether  they  develop  the 
disease  or  not  over  some  time  period.  In  contrast,  case-control  (retrospective)  studies 
proceed  from  effect  to  cause.  Cases  and  disease-free  controls  are  identified,  and  then 
the  exposure  status  of  these  individuals  is  determined.  Table  7.7  demonstrates  the 
simplest  example  in  which  there  is  a  single  binary  exposure,  with  yi3  representing 
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Table  7.7  Generic  2x2 
table  for  a  binary  exposure 
and  binary  disease  outcome 


Not  diseased 

Y  =  0 

Diseased 

Y  =  1 

Unexposed 

X  =  0 

2/00 

yoi 

n0 

Exposed 

X  =  1 

y  io 

y  ii 

n\ 

m0 

777-1 

n 

the  number  of  individuals  in  exposure  group  i,  i  =  0,1  and  disease  group  j, 
j  =  0, 1.  In  a  cohort  study,  n q  and  n\,  the  numbers  of  unexposed  and  exposed 
individuals,  are  fixed  by  design,  and  the  random  variables  are  the  number  of 
unexposed  cases  ym  and  the  number  of  exposed  cases  yn. 

There  are  a  number  of  strong  motivations  for  carrying  out  a  case-control  study. 
Since  many  diseases  are  rare,  a  cohort  study  has  to  generally  contain  a  large  number 
of  participants  to  demonstrate  an  association  between  a  risk  factor  and  disease 
because  few  individuals  will  develop  the  disease  (unless  the  effect  of  the  exposure 
of  interest  is  very  strong).  It  may  be  difficult  to  assemble  a  full  picture  of  the  disease 
across  subgroups  (as  defined  by  covariates)  within  a  cohort  study  because  the  cohort 
is  assembled  at  a  particular  time,  the  start  of  the  study.  As  the  study  proceeds,  certain 
subgroups,  for  example,  the  young,  disappear.  In  this  case  it  will  not  be  possible  to 
investigate  a  calendar  time/age  interaction,  that  is,  the  effect  of  calendar  time  at 
different  age  groups.  Finally,  the  disease  may  take  a  long  time  to  develop  (this  is 
true,  for  example,  for  most  cancers),  and  so  the  study  may  need  to  run  for  a  long 
period. 

The  case-control  study  provides  a  way  of  overcoming  these  difficulties.  With 
reference  to  Table  7.7,  mo  and  mi,  the  numbers  of  controls  and  case,  are  fixed  by 
design,  and  the  random  variables  are  the  number  of  exposed  controls  yw  and  the 
number  of  exposed  cases,  yn. 

A  case-control  study  is  not  without  its  drawbacks.  Probabilities  of  disease  given 
exposure  status  are  no  longer  directly  estimable  without  external  information,  as  we 
will  discuss  in  more  detail  shortly.  Most  importantly,  the  study  participants  must 
be  selected  very  carefully.  The  probability  of  selection  for  the  study,  for  both  cases 
and  controls,  must  not  depend  on  exposure  status;  otherwise,  selection  bias  will  be 
introduced;  this  bias  can  arise  in  many  subtle  ways.  The  great  benefit  of  case-control 
studies  is  that  we  can  still  estimate  the  strength  of  the  relationship  between  exposure 
and  disease,  a  topic  we  discuss  in-depth  in  the  next  section. 


7.10.2  Estimation  for  a  Case-Control  Study 

Consider  the  situation  in  which  we  have  a  binary  response  Y  taking  the  values  0/1 
corresponding  to  disease-free/diseased  and  exposures  contained  in  a  (fc  +  1)  x  1 
vector  x.  The  exposures  can  be  a  mix  of  continuous  and  discrete  variables.  In  the 
case-control  scenario,  we  select  individuals  on  the  basis  of  their  disease  status  y, 
and  the  random  variables  are  the  exposures  X. 
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In  a  cohort  study  with  a  binary  endpoint,  a  logistic  regression  disease  model  is 
the  most  common  choice  for  analysis,  with  form 


Pr(F  =  1  |  as)  =  p(x) 


exp  (/30  +  £*Li  Xjfij') 

1  +  exp  (/30  +  J2kj=1  XjPj) 


(7.31) 


The  relative  risk  of  individuals  having  exposures  x  and  x*  is  defined  as 


Relative  risk  = 


Pr(y  =  1  |  at) 
Pr(Y  =  1  |  x*) 


and  is  an  easily  interpretable  quantity  that  epidemiologists  are  familiar  with. 
As  already  mentioned  in  Sect.  7.6.1,  for  rare  diseases,  the  relative  risk  is  well 
approximated  by  the  odds  ratio 

Pr(y  =  1  I  x) /  Pr(y  =  0  I  x) 

Pr(y  =  1  I  x*)/  Pr(y  =  0  I  £C*)’ 

With  respect  to  the  logistic  regression  model  (7.31), 


p(x)/  [1  -p{x)\ 
p(x*)/  [1  -p(®*)] 


k 

i= i 


so  that,  in  particular,  exp (/?.,•)  represents  the  increase  in  the  odds  of  disease 
associated  with  a  unit  increase  in  Xj,  with  all  other  covariates  held  fixed  (Sect.  7.6.1). 
The  parameter  /?q  represents  the  baseline  log  odds  of  disease,  corresponding  to  the 
odds  when  all  of  the  exposures  are  set  equal  to  zero. 

We  now  turn  to  interpretation  in  a  case-control  study.  We  first  introduce  an 
indicator  variable  Z  which  represents  the  event  that  an  individual  was  selected  for 
the  study  ( Z  =  1)  or  not  ( Z  =  0).  Let  ny  =  Pr (Z  =  1  |  Y  =  y)  denote  the 
probabilities  of  selection,  given  response  y,  y  =  0, 1.  Typically,  ir\  is  much  greater 
than  7To,  since  cases  are  rarer  than  non-cases.  Now  consider  the  probability  that  a 
person  is  diseased,  given  exposures  x  and  selection  for  the  study: 


Pr(y  =  l\Z=l,x) 


Pr (z  =  1 1  y  =  1,  x)  Pr(y  =  1  |  x) 
Pr (Z  =  1  |  x) 


(7.32) 


The  denominator  may  be  simplified  to 


Pr (Z  =  1  |  x)  =  Pr (Z  =  1  |  Y  =  y,x)  Pr(y  =  y  \  x) 

v=o 


l 

=  J2Pt(z  = 1 1 Y  =  y)pr(y  =  v  i  *). 
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where  we  have  made  the  crucial  assumption  that 


Pr (Z  =  1  |  Y  =  y,  x)  =  Pr (Z  =  1  \  Y  =  y)  =  iry, 


for  y  =  0,1,  that  is,  the  selection  probabilities  depend  only  on  the  disease  status  and 
not  on  the  exposures  (i.e.,  there  is  no  selection  bias).  If  we  take  a  random  sample  of 
cases  and  controls,  this  assumption  is  valid.  Substitution  in  (7.32),  and  assuming  a 
logistic  regression  model,  gives 


Pr(Y  =  1  |  Z  =  l,x) 


it\  exp(a;/3)/[l  +  exp(®/3)] 

7Ti  exp(a:/3)/[l  +  exp(a:/3)]  +  7To/[l  +  exp(a;/3)] 


TTt  exp  (p0  +  £y=i  xjPj') 


7T0  +  7Tl  exp  (Pq  +  1  3 

exp  (p*0  +  J2kj=1  XjP^ 

1  +  exp 

(Po  +  Ej  =  l  XjPi 

r 

where  /3q  =  /?o+log  tti/tto.  Hence,  we  see  that  the  probabilities  of  disease  in  a  case- 
control  study  also  follow  a  logistic  model  but  with  an  altered  intercept.  In  the  usual 
case,  7Ti  >  7Tq  so  that  the  intercept  is  increased  to  account  for  the  over-sampling  of 
cases.  Unless  information  on  7To  and  7Ti  is  available,  we  cannot  obtain  estimates  of 
Pr( Y  =  1  |  x)  (the  incidence  for  different  exposure  groups). 

This  derivation  shows  that  assuming  a  logistic  model  in  the  cohort  context 
implies  that  the  disease  frequency  within  the  case-control  sample  also  follows  a 
logistic  model,  but  does  not  illuminate  how  inference  may  be  carried  out.  Suppose 
there  are  too  controls  and  toi  cases.  Since  the  exposures  are  random  in  a  case- 
control  context,  the  likelihood  is  of  the  form 

1  my 

l (o) = n  n  p&vj  i  y ’6>)’ 

y=o  j=i 

where  xyj  is  the  set  of  covariates  for  individual  j  in  disease  group  y,  and  it  appears 
that  we  are  faced  with  the  unenviable  task  of  specifying  forms,  depending  on 
parameters  6,  for  the  distribution  of  covariates  in  the  control  and  case  populations. 
In  a  seminal  paper,  Prentice  and  Pyke  (1979)  showed  that  asymptotic  likelihood 
inference  for  the  odds  ratio  parameters  was  identical  irrespective  of  whether  the 
data  are  collected  prospectively  or  retrospectively.  The  proof  of  this  result  hinges 
on  assuming  a  logistic  disease  model,  depending  on  parameters  /3,  with  additional 
nuisance  parameters  being  estimated  via  nonparametric  maximum  likelihood.  Great 
care  is  required  in  this  context  because  unless  the  sample  space  for  x  is  finite 
(i.e.,  the  covariates  are  all  discrete  with  a  fixed  number  of  categories),  the  dimension 
of  the  nuisance  parameter  increases  with  the  sample  size. 


7.10  Case-Control  Studies 


341 


To  summarize,  when  data  are  collected  from  a  case-control  study,  a  likelihood- 
based  analysis  with  a  logistic  regression  model  may  proceed  with  asymptotic 
inference,  acting  as  if  the  data  were  collected  in  a  cohort  fashion,  except  that  the 
intercept  is  no  longer  interpretable  as  the  baseline  log  odds  of  disease. 


7.10.3  Estimation  for  a  Matched  Case-Control  Study 

A  common  approach  in  epidemiological  studies  is  to  “match”  the  controls  to  the 
cases  on  the  basis  of  known  confounders.  By  choosing  controls  to  be  similar 
to  cases,  one  “controls”  for  the  confounding  variables.  This  provides  efficiency 
gains  since  the  controls  are  more  similar  to  the  cases  with  respect  to  confounders, 
which  increases  power.  It  also  removes  the  need  to  model  the  disease-confounder 
relationship. 

In  a  frequency-matched  design,  the  cases  are  grouped  into  broad  strata  (e.g.,  10- 
year  age  bands),  and  controls  are  matched  on  the  basis  of  these  variables.  In  an 
individually  matched  study,  controls  are  matched  exactly,  usually  upon  multiple 
variables,  for  example,  age,  gender,  time  of  diagnosis,  and  area  of  residence.  For 
both  forms  of  matching,  the  nonrandom  selection  of  controls  must  be  acknowledged 
in  the  analysis  by  including  a  parameter  for  each  matching  set  in  the  logistic  model. 

For  matched  data,  let  j  =  1 , ...  ,J  index  the  matched  sets,  and  Yt]  and  xl;) 
denote  the  responses  and  covariate  vector  of  additional  variables  (i.e.,  beyond  the 
matching  variables)  for  individual  i,  with  i  =  1, . . . ,  my  representing  the  cases  and 
i  =  my  +  1, . . . ,  mij  +  moj  the  controls.  Hence,  for  j  =  1, . . . ,  J, 


Vij  =  1  for  i  =  l,...,m1j 

Dij  =0  for  i  =  mij  +  1, . . . ,  my  +  m0j, 

and  there  are  mi  =  12  j- i  ™i j  cases  and  mo  =  Y2]=i  mo j  controls  in  total. 
The  disease  model  is 


where 


log 


_  1  —  Pj  iXij  )  . 


Oij  X  y  ft 


Pj{xi:j)  =  Pr (Yij  =  1  |  c Vij,  stratum  j) 


(7.33) 


for  i  =  l,...,  moj  +  my,  j  =  1, . . . ,  J.  In  terms  of  inference,  the  key  distinction 
between  the  two  matching  situations  is  that  in  the  frequency  matching  situation,  the 
number  of  matching  strata  J  is  fixed.  In  this  case,  the  result  outlined  in  Sect.  7.10.2 
can  be  extended  so  that  the  matched  data  can  be  analyzed  as  if  they  were  gathered 
prospectively,  though  the  intercept  parameters  cy  are  no  longer  interpretable  as  log 
odds  ratios  describing  the  association  between  disease  and  the  variables  defining 
stratum  j.  For  the  same  reason,  it  is  not  possible  to  estimate  interactions  between 
stratum  variables  and  exposures  of  interest.  Calculations  in  Breslow  and  Day  (1980) 
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show  that,  in  terms  of  efficiency  gains,  it  is  usually  not  worth  exceeding  5  controls 
per  case  and  3  will  often  be  sufficient.  Exercise  7.8  considers  the  analysis  of  a 
particular  set  of  data  to  illustrate  the  benefits  of  case-control  sampling  and  matching. 

For  individually  matched  data,  for  simplicity,  suppose  there  are  M  controls  for 
each  case  so  that  my  =  1  and  moy  =  M  for  all  j.  Hence,  mi  =  J  and  too  = 
MJ  =  Mmi .  Also  let  n  =  toi  represent  the  number  of  cases  so  that  too  =  Mn  is 
the  number  of  controls.  The  likelihood  contribution  of  the  jth  stratum  is 

M+l 

P{xij  |  Yij  =  1)  JJ  p(xij  |  Yij  =  0),  (7.34) 

i= 2 

but  care  is  required  for  inference  because  the  number  of  nuisance  parameters, 
ay ... ,  an,  is  equal  to  the  number  of  cases/matching  sets,  n,  and  so  increases  with 
sample  size. 

To  overcome  this  violation  of  the  usual  regularity  conditions,  a  conditional 
likelihood  may  be  constructed.  Specifically,  for  each  j,  one  conditions  on  the 
collection  of  M  +  1  covariate  vectors  within  each  matching  set.  The  conditional 
contribution  is  the  probability  that  subject  7  =  1  is  the  case,  given  it  could  have  been 
any  of  the  M  +  l  subjects  within  that  matching  set.  The  numerator  is  (7.34),  and  the 
denominator  is  this  expression  but  evaluated  under  the  possibility  that  each  of  the 
i  =  1, . . . ,  M  +  1  individuals  could  have  been  the  case.  Hence,  the  jth  contribution 
to  the  conditional  likelihood  is 

p( xij  i  Yij  =  m'tVpfrij  i  Yij  =  °) 

ERjP(Xn{l),j  I  Yu  =  l)YlZV  P(XHi)d  I  Yij  =  °) 

where  Rj  is  the  set  of  M  +  1  permutations,  ,  x7r(M+  i),j]  °f  [xij,  ■  ■  • , 

xM+i,j]-  Applying  Bayes  theorem  to  each  term, 


P(xij  |  Y  =  y)  = 


P{Y  =  y  |  Xij)p{xij) 


p(Y  =  y) 

and  taking  the  product  across  matching  sets,  we  obtain 

p(Yy  =  1  I  Xu)  YlfJi1  p{Yij  =  0  I  Xij) 


Lew = n 


j=l  S Rj  P(Ylj  —  1  I  *^(1),^')  rii=2  P(Yij  —  0  I  xir(i),j) 

Substitution  of  the  logistic  disease  model  (7.33)  yields  the  conditional  likelihood 

exp  (xy/3) 


L m  =  n 


E-LV  exp(xy/3) 

n  /  M+l  N 

= n  ( 1 + yi  exp  ~ x h)/3] 

j= 1  V  i= 2  / 


-1 


7. 1 1  Concluding  Remarks 


343 


Table  7.8  Notation  for  a 
matched-pair  case-control 
study  with  n  controls  and  n 
cases  and  a  single  exposure 


Not  diseased 

3'  =  0 

Diseased 

Y  =  1 

Unexposed 

X  =  0 

m0o 

m0 1 

Exposed 

X  =  1 

raio 

mn 

n 

n 

with  the  ay  terms  having  canceled  out,  as  was  required.  For  further  details,  see  Cox 
and  Snell  (1989)  and  Prentice  and  Pyke  (1979,  Sect.  6).  As  an  example,  if  M  =  2 
(two  controls  per  case),  the  conditional  likelihood  is 


Lc(P) 


n 


n 


_ exp(xljf3) _ 

exp(xij/3)  +  exp(x2j(3)  +  exp(£c3,,/3) 


exp  [(xij 


3= 1 


The  importance  of  the  use  of  conditional  likelihood  can  be  clearly  demonstrated 
in  the  matched-pairs  situation,  in  which  there  is  one  control  per  case.  Suppose 
that  the  data  are  as  summarized  in  Table  7.8  so  that  there  is  a  single  exposure 
only.  There  are  ?7ioo  concordant  pairs  in  which  neither  case  nor  control  is  exposed 
and  ?7in  concordant  pairs  in  which  both  are  exposed.  Exercise  7.12  shows  that 
the  unconditional  MLE  of  the  odds  ratio  is  (toio/tooi)2,  the  square  of  the  ratio 
of  discordant  pairs.  In  contrast,  the  estimate  based  on  the  appropriate  conditional 
likelihood  is  toio/tooi-  Hence,  the  unconditional  estimator  is  the  square  of  the 
correct  conditional  estimator. 

A  further  caveat  to  the  use  of  individually  matched  case-control  data  is  that  it 
is  more  difficult  to  generalize  inference  to  a  specific  population  under  this  design 
because  the  manner  of  selection  is  far  from  that  of  a  random  sample. 


7.11  Concluding  Remarks 

The  analysis  of  binomial  data  is  difficult  unless  the  denominators  are  large  because 
there  is  so  little  information  in  a  single  Bernoulli  outcome.  In  addition,  the  models 
for  probabilities  are  typically  nonlinear.  Logistic  regression  models  are  the  obvious 
candidate  for  analysis,  but  the  interpretation  of  odds  ratios  is  not  straightforward, 
unless  the  outcome  of  interest  is  rare.  The  effect  of  omitting  variables  is  also 
nonobvious.  The  fact  that  the  linear  logistic  model  is  a  GLM  does  offer  advantages 
in  terms  of  consistency,  however,  and  the  logit  being  the  canonical  link  gives 
simplifications  in  terms  of  computation. 
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The  use  of  conditional  likelihood  in  individually  matched  case-control  studies 
in  practice  is  uncontroversial,  but  its  theoretical  underpinning  is  not  completely 
convincing  (since  the  conditioning  statistic  is  not  ancillary).  Fisher’s  exact  test  is 
historically  popular,  but,  as  discussed  in  Sect.  4.2,  frequentist  hypothesis  testing  can 
be  difficult  to  implement  in  practice  since  p- values  need  to  be  interpreted  in  the 
context  of  the  sample  size.  For  Fisher’s  exact,  the  discreteness  of  the  test  statistic  can 
also  be  problematic.  Exercise  7.1 1  provides  an  alternative  approach  based  on  Bayes 
factors.  The  latter  do  not  suffer  from  the  discreteness  of  the  sampling  distribution 
(since  one  only  uses  the  observed  data  and  not  other  hypothetical  realizations). 


7.12  Bibliographic  Notes 

Robinson  and  Jewell  (1991)  examine  the  effects  of  omission  of  variables  in  logistic 
regression  models  and  contrast  the  implications  with  the  linear  model  case.  Green¬ 
land  et  al.  (1999)  is  a  wide-ranging  discussion  on  collapsibility  and  confounding. 
A  seminal  book  on  the  design  and  analysis  of  case-control  studies  is  Breslow  and 
Day  (1980).  There  is  no  Bayesian  analog  of  the  Prentice  and  Pyke  (1979)  result 
showing  the  equivalence  of  odds  ratio  estimation  for  prospective  and  retrospective 
sampling,  though  Seaman  and  Richardson  (2004)  show  the  equivalence  in  restricted 
circumstances.  Simplified  estimation  based  on  nonparametric  maximum  likelihood 
has  also  been  established  for  other  outcome-dependent  sampling  schemes  such  as 
two-phase  sampling;  see,  for  example,  White  (1982)  and  Breslow  and  Chatterjee 
(1999).  Again,  no  equivalent  Bayesian  approaches  are  available.  A  fully  Bayesian 
approach  in  a  case-control  setting  would  require  the  modeling  of  the  covariate  dis¬ 
tributions  for  each  of  the  cases  and  controls,  which  is,  in  general,  a  difficult  process 
and  seems  unnecessary  given  that  there  is  no  direct  interest  in  these  distributions. 
Hence,  the  nonparametric  maximum  likelihood  procedure  seems  preferable,  though 
a  hybrid  approach  in  which  one  simply  combines  the  prospective  likelihood  with  a 
prior  would  seem  practically  reasonable  if  one  has  prior  information  and/or  one  is 
worried  about  asymptotic  inference. 

Rice  (2008)  shows  the  equivalence  between  conditional  likelihood  and  random 
effects  approaches  to  the  analysis  of  matched-pairs  case-control  data.  In  general, 
conditional  likelihood  does  not  have  a  Bayesian  interpretation,  though  Bayesian 
analyses  have  been  carried  out  in  the  individually  matched  case-control  situation 
by  combining  a  prior  with  the  conditional  likelihood.  This  approach  avoids  the 
difficulty  of  specifying  priors  over  nuisance  parameters  with  dimension  equal  to 
the  number  of  matching  sets  (Diggle  et  al.  2000). 

Fisher’s  exact  test  has  been  discussed  extensively  in  the  statistics  literature; 
see,  for  example,  Yates  (1984).  Altham  (1969)  published  an  intriguing  result 
showing  that  Fisher’s  exact  test  is  equivalent  to  a  Bayesian  analysis.  Specifically, 
let  poo ,  Pio ,  Poi  i  Pi i  denote  the  underlying  probabilities  in  a  2  x  2  table  with  entries 
V  =  [j/oOi  J/io 5  2/oi  t  2/h]t  (see  Table  7.3),  and  suppose  the  prior  on  these  probabilities 
is  (improper)  Dirichlet  with  parameters  (0,1, 1,0).  Then  the  posterior  probability 
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Pr(pnp22/Pi2P2i  <  1  |  y)  equals  the  Fisher’s  exact  test  p-value  for  testing 
H0  :  pooPn  =  PioPoi  versus  Hi  :  pooPn  <  PwPoi-  Hence,  the  prior  (slightly) 
favors  a  negative  association  between  rows  and  columns,  which  is  related  to  the  fact 
that  conditioning  on  the  margins  (as  is  done  in  Fisher’s  exact  test)  does  lead  to  a 
small  loss  of  information. 


7.13  Exercises 


7.1  Suppose  Z  |  p  ~  Bernoulli(p). 

(a)  Show  that  the  moment-generating  function  (Appendix  D)  of  Z  is  Mz  (t)  = 
1  —  p  +  pexp(f).  Hence,  show  that  the  moment-generating  function  of 

Y  =  EtiZlis 

MY  =  [1 -p  +  pexp(f)]n, 

which  is  the  moment-generating  function  of  a  binomial  random  variable. 

(b)  Suppose  Y  |  A  ~  Poisson(A).  Show  that  the  cumulant-generating 
function  (Appendix  D)  of  Y  is 

A[exp(f)  —  1], 


(c)  From  part  (a),  obtain  the  form  of  the  cumulant-generating  function  of  Y. 
Suppose  thatp  — >  0  and  n  — >  oo  in  such  a  way  that  p  =  np  remains  fixed. 
By  considering  the  limiting  form  of  the  cumulant-generating  function  of 
Y,  show  that  in  this  situation,  the  limiting  distribution  of  Y  is  Poisson 
with  mean  p. 


7.2 


Before  the  advent  of  GLMs,  the  arc  sine  variance  stabilizing  transforma¬ 
tion  was  used  for  the  analysis  of  binomial  data.  Suppose  that  Y  \  p  ~ 
Binomial(iV,  p)  with  N  large.  Using  a  Taylor  series  expansion,  show  that  the 
random  variable 


W  =  arcsin 


VWlN) 


has  approximate  first  two  moments: 


E[W]  w  arcsin(v/p) 


var(PF) 


1 

AN' 


1  -  2 p 

8  ^Np(l-p) 


7.3  Suppose  Zj  \  A j  ~ind,  Poisson(Aj),  j  =  1,2  are  independent  Poisson  random 
variables  with  rates  A  j .  Show  that 


Zi  I  Zi  +  Z2lp  ~  Binomial(Zi  +  Z2,p), 


with  p  =  Ai/(Ai  +  A2). 
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7.4  Consider  n  Bernoulli  trials  with  Zrj ,  j  =  1 .....  A3  the  outcomes  within-trial 
i  with  Yi  =  i  =  1, . . . ,  n.  By  writing 


var 


Ni  Ni 

w  =  E  var  (Zij)  +  EE  coy  (Zij ,  Zik) , 

j=i  j=i 


show  that 

var(Y))  =  -  ft)  x  [1  +  (Ni  -  l)r?]. 

7.5  With  respect  to  Sect.  7.6.2,  show  that  for  Bernoulli  data  the  Pearson  statistic 
is  X2  =  n.  Find  the  deviance  in  this  situation  and  comments  on  its  usefulness 
as  a  test  of  goodness  of  fit. 

7.6  Show  that  the  extended  hypergeometric  distribution  (7.22)  is  a  member  of  the 
exponential  family  (Sect.  6.3),  that  is,  the  distribution  can  be  written  in  the 
form 

Pr (t/11  |  0,  a)  =  exp  +  c(t/,  a)^ 

for  suitable  choices  of  a,  b(-),  and  c(-,  •). 

7.7  In  this  question,  a  simulation  study  to  investigate  the  impact  on  inference  of 
omitting  covariates  in  logistic  regression  will  be  performed,  in  the  situation 
in  which  the  covariates  are  independent  of  the  exposure  of  interest.  Let  x  be 
the  covariate  of  interest  and  z  another  covariate.  Suppose  the  true  (adjusted) 
model  is  Y)  |  x*,  z,j  ~iiti  Bernoulli^),  with 


log 


A)  +Al  Xi  +  Zi. 


(7.35) 


A  comparison  with  the  unadjusted  model  1)  |  xt  ~ud  Bernoulli(p*),  where 

los(l^)  (7-36) 

for  i  =  1, . . . ,  n  =  1,000  will  be  made.  Suppose  x  is  binary  with  Pr(X=l)  = 
0.5  and  Z  ~ad  N(0, 1)  with  x  and  0  independent.  Combinations  of  the 
parameters  (3i  =  0.5, 1.0  and  @2  =  0.5, 1.0, 2.0, 3.0,  with  /3q  =  —2  in  all 
cases,  will  be  considered. 

For  each  combination  of  parameters,  compare  the  results  from  the  two 
models,  (7.35)  and  (7.36),  with  respect  to: 

(a)  E[/3i]  and  E[/3*],  as  compared  to  f3\ 

(b)  The  standard  errors  of  A  and  /3* 

(c)  The  coverage  of  95%  confidence  intervals  for  j3\  and  /?^ 

(d)  The  probability  of  rejecting  Hq  :  pi  =  0  in  model  (7.35)  and  the 

probability  of  rejecting  Hq  :  =  0  in  model  (7.36).  These  probabilities 

correspond  to  the  powers  of  the  tests.  Calculate  these  probabilities  using 
Wald  tests. 
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Table  7.9  Left  table',  leprosy  cases  and  non-cases  versus  presence/absence 
of  BCG  scar.  Right  table :  leprosy  cases  and  controls  versus  presence/ 
absence  of  BCG  scar 


BCG  scar 

Cases 

Non-cases 

BCG  scar 

Cases 

Controls 

Present 

101 

46,028 

Present 

101 

554 

Absent 

159 

34,594 

Absent 

159 

446 

Based  on  the  results,  summarize  the  effect  of  omitting  a  covariate  that  is 
independent  of  the  exposure  of  interest,  in  particular  in  comparison  with  the 
linear  model  case  (as  discussed  in  Sect.  5.9). 

7.8  This  question  illustrates  the  benefits  of  case-control  and  matched  case-control 
sampling,  taking  data  from  Fine  et  al.  (1986)  and  following  loosely  the 
presentation  of  Clayton  and  Hills  (1993).  Table  7.9  gives  data  from  a  cross- 
sectional  survey  carried  out  in  Northern  Malawi.  The  aim  of  this  study  was  to 
investigate  whether  receiving  a  bacillus  Calmette-Guerin  (BCG)  vaccination 
in  early  childhood  (which  protects  against  tuberculosis)  gives  any  protection 
against  leprosy.  Let  X  =  0/1  denote  absence/presence  of  BCG  scar,  Y  =  0/1 
denote  leprosy-free/leprosy,  andp^  =  Pr(y  =  1  |  X  =  x),  x  =  0, 1: 

(a)  Fit  the  logistic  model 

log  (  Px  )  =  A)  +  fhx 

to  the  case/non-case  data  in  the  left  half  of  Table  7.9.  Report  your  findings 
in  terms  of  an  estimate  of  the  odds  ratio  exp(/3i)  along  with  an  associated 
standard  error. 

(b)  Now  consider  the  case/control  data  in  the  right  half  of  Table  7.9  (these  data 
were  simulated  from  the  full  dataset).  Fit  the  logistic  model 

log(rr^)  =«  +  A* 

to  the  case/control  data,  and  again  report  your  findings  in  terms  of  the 
odds  ratio  exp(/?i)  along  with  an  associated  standard  error.  Hence,  use  this 
example  to  describe  the  benefits,  in  terms  of  efficiency,  of  a  case-control 
study. 

(c)  In  this  example,  the  population  data  are  known  and  consequently  the  sam¬ 
pling  fractions  of  cases  and  controls  are  also  known.  Hence,  reconstruct 
an  estimate  of  /30,  using  the  results  from  the  case-control  analysis. 

(d)  Next  the  benefits  of  matching  will  be  illustrated.  BCG  vaccination  was 
gradually  introduced  into  the  study  region,  and  so  older  people  are  less 
likely  to  have  been  vaccinated  but  also  more  likely  to  have  developed 
leprosy.  Therefore,  age  is  a  potential  confounder  in  this  study. 

Let  z  =  0, 1, . . . ,  6  denote  age  represented  as  a  factor  and  pxz  = 
Pr(Y  =  1  |  A'  =  x, ),  for  x  =  0, 1,  denote  the  probability  of  leprosy 
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Table  7.10  Left  table :  leprosy  cases  and  non-cases  as  a  function  of  presence/absence  of  BCG  scar 
and  age.  Right  table :  leprosy  cases  and  matched  controls  as  a  function  of  presence/absence  of  BCG 
seal'  and  age 


BCG  scar  BCG  scar 


Age 

Cases 

Non-cases 

Age 

Cases 

Controls 

Absent 

Present 

Absent 

Present 

Absent 

Present 

Absent 

Present 

0-4 

1 

1 

7,593 

11,719 

0-4 

1 

1 

3 

5 

5-9 

11 

14 

7,143 

10,184 

5-9 

11 

14 

48 

52 

10-14 

28 

22 

5,611 

7,561 

10-14 

28 

22 

67 

133 

15-19 

16 

28 

2,208 

8,117 

15-19 

16 

28 

46 

130 

20-24 

20 

19 

2,438 

5,588 

20-24 

20 

19 

50 

106 

25-29 

36 

11 

4,356 

1,625 

25-29 

36 

11 

126 

62 

30-34 

47 

6 

5,245 

1,234 

30-34 

47 

6 

174 

38 

for  an  individual  with  BCG  status  x  and  in  age  strata  z.  To  adjust  for  age, 
fit  the  logistic  model 

log  (  Pxz  )  =  /?o  +  Pix  +  /3zz 

\ft-PxzJ 

to  the  data  in  the  left  half  of  Table  7.10.  This  model  assumes  a  common 
odds  ratio  across  age  strata.  Report  your  findings  in  terms  of  the  odds  ratio 
exp(/?i)  and  associated  standard  error. 

(e)  If  it  were  possible  to  sample  controls  from  the  non-cases  in  the  left  half  of 
Table  7.10,  the  age  distribution  would  be  highly  skewed  toward  the  young, 
which  would  lead  to  an  inefficient  analysis.  As  an  alternative,  the  right 
half  of  Table  7.10  gives  a  simulated  frequency-matched  case-control  study 
with  4  controls  per  case  within  each  age  strata.  Analyze  these  data  using 
the  logistic  model 


log  f  Pxz  )  =  ft  +  (3ix  +  /3*z, 

V  1  -  Pxz  J 

and  report  your  findings  in  terms  of  exp(/3i)  and  its  associated  standard 
error.  Comment  on  the  accuracy  of  inference  as  compared  to  the  analysis 
using  the  complete  data. 

7.9  Table  7.11  gives  data  from  a  toxicological  experiment  in  which  the  number 
of  beetles  that  died  after  5  h  exposure  to  gaseous  carbon  disulphide  at  various 
doses. 

(a)  Fit  complementary  log-log,  probit,  and  logit  link  models  to  these  data 
using  likelihood  methods. 

(b)  Summarize  the  association  for  each  model  in  simple  terms. 

(c)  Examine  residuals  and  report  the  model  that  you  believe  provides  the  best 
fit  to  these  data,  along  with  your  reasoning. 
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Table  7.11  Number  of 
beetle  deaths  as  a  function  of 

Log  dose  No. 

beetles 

No.  killed 

1.691 

1.724 

59 

60 

log  dose,  from  Bliss  (1935) 

13 

1.755 

62 

18 

1.784 

56 

28 

1.811 

63 

52 

1.837 

59 

53 

1.861 

62 

61 

1.884 

60 

60 

Table  7.12  Death  penalty 
verdict  by  race  of  victim  and 
defendant 

Defendant’s 

race 

Victim’s 

race 

Death  penalty 

Yes  No 

White 

White 

19 

132 

Black 

0 

9 

Black 

White 

11 

52 

Black 

06 

97 

(d)  Fit  your  favored  model  with  a  Bayesian  approach  using  (improper)  flat 
priors.  Is  there  a  substantive  difference  in  the  conclusions,  as  compared  to 
the  likelihood  analysis? 

7.10  Table  7.12  contains  data  from  Radelet  (1981)  on  death  penalty  verdict,  cross- 

classified  by  defendant’s  race  and  victim’s  race. 

(a)  Fit  a  logistic  regression  model  that  includes  factors  for  both  defendant’s 
race  and  victim’s  race.  Estimate  the  odds  ratios  associated  with  receiving 
the  death  penalty  if  Black  as  compared  to  if  White,  for  the  situations  in 
which  the  victim  was  White  and  in  which  the  victim  was  Black. 

(b)  Fit  a  logistic  regression  model  to  the  marginal  2x2  table  that  collapses 
across  victim’s  race,  and  hence,  estimate  the  odds  ratio  associated  with 
receiving  the  death  penalty  if  Black  versus  if  White. 

(c)  Discuss  the  results  of  the  two  parts,  in  relation  to  Simpson’s  paradox.  In 
particular,  discuss  the  paradox  in  terms  understandable  to  a  layperson. 

7.11  Suppose  Yi  |  pi  ~  Binomial(iVi,pi)  for  i  =  0, 1  and  that  interest  focuses  on 

H0  ■  Po  =  Pi  =  P  versus  Hi  :  p0  ±  pp. 

(a)  Consider  the  Bayes  factor  (Sect.  3.10) 

Bp  =  Pr(t/p,yi  |  H0) 

Pr(t/o,2/i  I  Hi) 

with  the  priors:  p  ~  Be(ao,6o)  under  Hq  and  pi  ~  Be(ai,6i),  for  i  = 
0, 1,  under  Hi.  Obtain  a  closed-form  expression  for  the  Bayes  factor. 

(b)  Calculate  the  Bayes  factor  for  the  tumor  data  given  in  Table  7.4  using 
uniform  priors,  that  is,  ciq  =  ai  =  bo  =  &i  =  1. 

(c)  Based  on  the  Bayes  factor,  would  you  reject  HqI  Why? 
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(d)  Using  the  same  priors  as  in  the  previous  part,  evaluate  the  posterior 
probability  that  Pr(po  >  Pi  \  ychUi)-  Based  on  this  probability,  what 
would  you  conclude  about  equality  of  p0  and  p  \  ?  Is  your  conclusion  in 
agreement  with  the  previous  part? 

[Hint:  Obtaining  samples  from  the  posteriors  p(pi  \  yi),  for  i  =  0.1.  is  a 
simple  way  of  obtaining  the  posterior  of  interest  in  the  final  part.] 


7.12  This  question  derives  unconditional  and  conditional  estimators  for  the  case  of 
a  matched-pairs  case-control  design  with  n  pairs  and  a  binary  exposure.  The 
notation  is  given  in  Table  7.8,  and  the  logistic  model  in  the  jth  matching  set  is 


Pr(F  =  1  |  x,j) 


exp(ay  +  xf3) 

1  +  exp(ay  +  x/3 )  ’ 


for  x  =  0, 1  and  j  =  1, . . . ,  n. 


(a)  Show  that  the  unconditional  maximum  likelihood  estimator  of  /3  is  the 
square  of  the  ratio  of  the  discordant  pairs,  (mio/moi)2. 

(b)  Show,  by  considering  the  distribution  of  m io  given  the  total  mio  + 
moi,  that  the  estimate  based  on  the  appropriate  conditional  likelihood  is 

m10/m01. 
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Dependent  Data 


Chapter  8 

Linear  Models 


8.1  Introduction 

In  Part  III  of  the  book  the  conditional  independence  assumptions  of  Part  II  are 
relaxed  as  we  consider  models  for  dependent  data.  Such  data  occur  in  many 
contexts,  with  three  common  situations  being  when  sampling  is  over  time,  space, 
or  within  families.  We  do  not  discuss  pure  time  series  applications  in  which  data 
are  collected  over  a  single  (usually  long)  series;  this  is  a  vast  topic  with  many 
specialized  texts.  Generically,  we  consider  regression  modeling  situations  in  which 
there  are  a  set  of  units  (“clusters”)  upon  which  multiple  measurements  have  been 
collected.  For  example,  when  data  are  available  over  time  for  a  group  of  units,  we 
have  longitudinal  (also  known  as  repeated  measures )  data,  and  each  unit  forms 
a  cluster.  We  will  often  refer  to  the  units  as  individuals.  The  methods  described 
in  Part  II  for  calculating  uncertainty  measures  (such  as  standard  errors)  are  not 
applicable  in  situations  in  which  the  data  are  dependent. 

Throughout  Part  III  we  distinguish  approaches  that  specify  a  full  probability 
model  for  the  data  (with  likelihood  or  Bayesian  approaches  to  inference)  and 
those  that  specify  first,  and  possibly  second,  moments  only  (with  an  estimating 
function  being  constructed  for  inference).  As  in  Part  II  we  believe  it  will  often 
be  advantageous  to  carry  out  inference  from  both  standpoints  in  a  complimentary 
fashion.  In  some  instances  the  form  of  the  question  of  interest  may  be  best  served 
by  a  particular  approach,  however,  and  this  will  be  stressed  at  relevant  points. 

In  this  chapter  we  consider  linear  regression  models.  Such  models  are  widely 
applicable  with  growth  curves,  such  as  the  dental  data  of  Sect.  1.3.5,  providing 
a  specific  example.  As  another  example,  in  the  so-called  split-plot  design,  fields 
are  planted  with  different  crops  and  within  each  field  (unit),  different  subunits  are 
treated  with  different  fertilizers.  We  expect  crop  yields  in  the  same  field  to  be  more 
similar  than  those  in  different  fields,  and  yields  may  be  modeled  as  a  linear  function 
of  crop  and  fertilizer  effects.  With  clustered  data,  we  expect  measurements  on  the 
same  unit  to  exhibit  residual  dependence  due  to  shared  unmeasured  variables,  where 
the  qualifier  acknowledges  that  we  have  controlled  for  known  regressors. 
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The  structure  of  this  chapter  is  as  follows.  We  begin,  in  Sect.  8.2,  with  a  brief 
overview  of  approaches  to  inference  for  dependent  data,  in  the  context  of  the 
dental  data  of  Sect.  1.3.5.  Section  8.3  provides  a  description  of  the  efficiency 
gains  that  can  be  achieved  with  data  collected  over  time  in  a  longitudinal  design. 
In  Figure  8.1(a),  linear  mixed  effects  models,  in  which  full  probability  models 
are  specified  for  the  data,  are  introduced.  In  Sects.  8.5  and  8.6,  likelihood  and 
Bayesian  approaches  to  inference  for  these  models  are  described.  Section  8.7 
discusses  the  generalized  estimating  equations  (GEE)  approach  which  is  based  on  a 
marginal  mean  specification  and  empirical  sandwich  estimation  of  standard  errors. 
We  describe  how  the  assumptions  required  for  valid  inference  may  be  assessed  in 
Sect.  8.8  and  discuss  the  estimation  of  longitudinal  and  cohort  effects  in  Sect.  8.9. 
Concluding  remarks  appear  in  Sect.  8.10  with  bibliographic  notes  in  Sect.  8.1 1. 


8.2  Motivating  Example:  Dental  Growth  Curves 


In  Table  1.3  dental  measurements  of  the  distance  in  millimeters  from  the  center  of 
the  pituitary  gland  to  the  pteryo-maxillary  fissure  are  given  for  1 1  girls  and  16  boys, 
recorded  at  the  ages  of  8,  10,  12,  and  14  years.  In  this  section  we  concentrate  on 
the  data  from  the  girls  only.  Figure  8.1(a)  plots  the  dental  measurements  for  each 
girl  versus  age.  The  slopes  look  quite  similar,  though  there  is  clearly  between-girl 
variability  in  the  intercepts. 

There  are  various  potential  aims  for  the  analysis  of  data  such  as  these: 

1 .  Population  inference,  in  which  we  describe  the  average  growth  as  a  function  of 
age,  for  the  population  from  which  the  sample  of  children  were  selected. 

2.  Assessment  of  the  within-  to  between-child  variability  in  growth  measurements. 


Fig.  8.1  Dental  data  for  girls  only:  (a)  individual  observed  data  (with  the  girl  index  taken  as 
plotting  symbol),  (b)  individual  fitted  curves  (dashed)  and  overall  fitted  curve  (solid) 
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3.  Individual-level  inference,  either  for  a  child  in  the  sample,  or  for  a  new 
unobserved  child  (from  the  same  population).  The  latter  could  be  used  to  con¬ 
struct  a  “growth  chart”  in  which  the  percentile  points  of  children’s  measurements 
at  different  ages  are  presented. 

Part  III  of  the  book  will  provide  extensive  discussion  of  mixed  effects  models  which 
contain  both  fixed  effects  that  are  shared  by  all  individuals  and  random  effects  that 
are  unique  to  particular  individuals  and  are  assumed  to  arise  from  a  distribution.  For 
longitudinal  data  there  are  two  extreme  fixed  effects  approaches.  Proceeding  naively, 
we  could  assume  a  single  “marginal”  curve  for  all  of  the  girls  data  and  carry  out 
a  standard  analysis  assuming  independent  data.  Marginal  here  refers  to  averaging 
over  girls  in  the  population.  At  the  other  extreme  we  could  assume  a  distinct  curve 
for  each  girl.  Figure  8.1(b)  displays  the  least  squares  fitted  lines  corresponding  to 
each  of  these  fixed  effects  approaches. 

Continuing  with  the  marginal  approach,  let  Yi:)  denote  the  jth  measurement, 
taken  at  time  tj  on  the  zth  child,  i  =  1, ...  ,m  =  11,  j  =  1, . . .  ,n*  =  4.  Consider 
the  model 


(8.1) 


where  /3“  and  /3“  represent  marginal  intercept  and  slope  parameters.  Then, 


i  =  1, . . . ,  11;  j  =  1, . . . ,  4,  denote  marginal  residuals.  In  Part  II  of  the  book,  we 
emphasized  conditional  independence,  so  that  observations  were  independent  given 
a  set  of  parameters;  due  to  dependence  of  observations  on  the  same  girl,  we  would 
not  expect  the  marginal  residuals  to  be  independent. 

We  fit  the  marginal  model  (8.1)  to  the  data  from  all  girls  and  let 


ofi 

Pl2  <*2 
Pl3  P23  &3 
,p!4  P24  P34  &4. 


(8.2) 


represent  the  standard  deviation/correlation  matrix  of  the  residuals.  Here, 


is  the  standard  deviation  of  the  dental  length  at  time  tj  and 

 cov(e“-,e“ ) 


is  the  correlation  between  residual  measurements  taken  at  times  tj  and  tk  on  the 
same  girl,  j  k,j,  k  =  1, ...  ,4.  We  assume  four  distinct  standard  deviations 
at  each  of  the  ages,  and  distinct  correlations  between  measurements  at  each  of 


356 


8  Linear  Models 


the  six  combinations  of  pairs  of  ages,  but  assume  that  these  standard  deviations 
and  correlations  are  constant  across  all  girls.  We  empirically  estimate  the  entries 
of  (8.2)  as 


'2.12 
0.83  1.90 
0.86  0.90  2.36 
.0.84  0.88  0.95  2.44 


(8.3) 


illustrating  that,  not  surprisingly,  there  is  clear  correlation  between  residuals  at 
different  ages  on  the  same  girl.  Fitting  a  single  curve  to  the  totality  of  the  data 
and  using  methods  for  independent  data  that  assume  within-girl  correlations  are 
zero  will  clearly  give  inappropriate  standard  errors/uncertainty  estimates  for  /3“  and 
/3“.  Fitting  such  a  marginal  model  is  appealing,  however,  since  it  allows  the  direct 
calculation  of  the  average  responses  at  different  ages.  Fitting  a  marginal  model 
forms  the  basis  of  the  GEE  approach  described  in  Sect.  8.7. 

The  alternative  fixed  effects  approach  is  to  assume  a  fixed  curve  for  each  child 
and  analyze  each  set  of  measurements  separately.  However,  while  providing  valid 
inference  for  each  curve,  there  is  no  “borrowing  of  strength”  across  children,  so  that 
each  girl’s  fit  is  based  solely  on  her  data  only  and  not  on  the  data  of  other  children. 
We  might  expect  that  there  is  similarity  between  the  curves,  and  therefore,  it  is 
reasonable  to  believe  that  the  totality  of  data  will  enhance  estimation  for  each  child. 
In  some  instances,  using  the  totality  of  data  will  be  vital.  For  example,  estimating  the 
growth  curve  for  a  girl  with  just  a  single  observation  is  clearly  not  possible  using 
the  observed  data  on  that  girl  only.  Suppose  we  are  interested  in  making  formal 
inference  for  the  population  of  girls  from  which  the  m  =  11  girls  are  viewed  as 
a  random  sample;  this  is  not  formally  possible  using  the  collection  of  fixed  effects 
estimates  from  each  girl.  The  basis  of  the  mixed  effects  model  approach  described  in 
Sect.  8.4  is  to  assume  a  girl-specific  set  of  random  effect  parameters  that  are  assumed 
to  arise  from  a  population.  In  different  contexts,  random  effects  may  have  a  direct 
interpretation  as  arising  from  a  population  of  effects,  or  may  simply  be  viewed  as  a 
convenient  modeling  tool,  in  situations  in  which  there  is  no  hypothetical  population 
of  effects  to  appeal  to. 

Throughout  Part  III,  we  will  describe  mixed  effects  and  GEE  approaches  to 
analysis.  The  mixed  effects  approach  can  be  seen  as  having  a  greater  contextual 
basis,  since  it  builds  up  a  model  from  the  level  of  the  unit.  In  contrast,  with  a 
marginal  model,  as  specified  in  GEE,  the  emphasis  is  on  population  inference  based 
on  minimal  assumptions  and  on  obtaining  a  reliable  standard  error  via  sandwich 
estimation. 


8.3  The  Efficiency  of  Longitudinal  Designs 

While  making  inference  for  dependent  data  is  in  general  more  difficult  than  for 
independent  data,  designs  that  collect  dependent  data  can  be  very  efficient.  For 
example,  in  a  longitudinal  data  setting,  applying  different  treatments  to  the  same 


8.3  The  Efficiency  of  Longitudinal  Designs 


357 


patient  over  time  can  be  very  beneficial,  since  each  patient  acts  as  his/her  own 
control.  To  illustrate,  we  provide  a  comparison  between  longitudinal  and  cross- 
sectional  studies  (in  which  data  are  collected  at  a  single  time  point);  this  section 
follows  the  development  of  Diggle  et  al.  (2002,  Sect.  2.3). 

We  consider  a  very  simple  situation  in  which  we  wish  to  compare  two  treatments, 
coded  as  —1  and  +1,  and  take  four  measurements  in  total.  In  the  cross-sectional 
study  a  single  measurement  is  taken  on  each  of  four  individuals  with 

ka  =  0o  +  PiXn  +  £ji,  (8.4) 

for  i  =  1 , ,m  =  4.  The  error  terms  en  are  independent  with  E[eji]  =  0  and 
var(eil)  =  a2.  The  design  is  such  that  x\\  =  —1,  Xi\  =  —1,  £31  =  1,  £41  =  1,  so 
that  individuals  1  and  2  (3  and  4)  receive  treatment  —1  (+1).  With  this  coding,  the 
treatment  effect  is 


E[Fi  |  x  =  1]  -  E[Fi  |  sc  =  —1]  =  2/3r. 

The  (unbiased)  ordinary  least  squares  (OLS)  estimators  are 

Sc  Ei=t  yii  Sc  Y3i+Yii-(Yn+Y2i) 

Po-  4  ’Pi-  4 

and,  more  importantly  for  our  purposes,  the  variance  of  the  treatment  estimator  is 


The  subscript  here  labels  the  relevant  quantities  as  arising  from  the  cross-sectional 
design. 

For  the  longitudinal  study  we  assume  the  model 


k  ij  —  /30  -f  Ti  Xij  T  b,  -)-  bjy , 

with  bi  and  Sij  independent  and  E[(5ii]  =  0,  var(<5^)  =  a2,  E[6^]  =  0,  var(&i)  =  <Tq, 
for  i  =  1,  2,  j  =  1,  2,  so  that  we  record  two  observations  on  each  of  two  individuals. 
The  h,  represent  random  individual-specific  parameters  and  f  l;J  measurement  error. 
Marginally,  that  is,  averaging  over  individuals,  and  with  Y  =  [kii,  I12,  T21, 
we  have  var(k^)  =  a2  R  with 

'1  p  0  O' 

p  10  0 

0  0  1  p 

.0  0  p  1 . 


R  = 


(8.5) 
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where  a2  =  <Tq  -f  cr2  is  the  sum  of  the  between-  and  within-individual  variances  and 
p  =  alia1  is  the  correlation  between  observations  on  the  same  individual.  Notice 
that  the  cross-sectional  variance  model  is  a  special  case  of  (8.4)  with  en  =  bi  +  da. 
We  consider  two  designs.  In  the  first,  the  treatment  is  constant  over  time  for  each 
individual:  iu  =  X12  =  — 1,221  =  X22  =  1,  while  in  the  second  each  individual 
receives  both  treatments:  Xu  =  X22  =  1,2:12  =  £21  =  — 1.  Generalized  least 
squares  gives  unbiased  estimator 

(3  =  {xT R  lx)~1x1R~lY ,  (8.6) 


with 

var(/3  )  =  (xT  R~1x)~1cr2 , 

and  where  R  is  given  by  (8.5).  The  variance  of  the  “slope”  estimator  is 

4  -  2p(xuxi2  +  X21X22) 

The  efficiency  of  the  longitudinal  design,  as  compared  to  the  cross-sectional  design, 
is  therefore 

var(^i)  =  _ (1  ~  P 2) _ 

var(/3J)  —  p{xi\X\2  +  X21X22) /2 

The  efficiency  of  the  longitudinal  study  with  constant  treatments  across  individu¬ 
als  is 

1  +  p, 

so  that  in  this  case,  the  cross-sectional  study  is  preferable  in  the  usual  situation 
in  which  observations  on  the  same  individual  display  positive  correlation,  that 
is,  p  >  0.  When  the  treatment  is  constant  within  individuals,  the  treatment  estimate 
is  based  on  between-individual  comparisons  only,  and  so,  it  is  more  beneficial  to 
obtain  measurements  on  additional  individuals. 

The  efficiency  of  the  longitudinal  study  with  treatments  changing  within  individ¬ 
uals  is 

1  -  P, 

so  that  the  longitudinal  study  is  more  efficient  when  p  >  0,  because  each 
individual  is  acting  as  his/her  own  control.  That  is,  we  are  making  within- 
individual  comparisons.  If  p  =  0,  the  designs  have  the  same  efficiency.  In  practice, 
collecting  two  measurements  on  different  individuals  will  often  be  logistically  more 
straightforward  than  collecting  two  measurements  on  the  same  individual  (e.g.,with 
the  possibility  of  missing  data  at  the  second  time  point),  but  in  pure  efficiency  terms, 
the  longitudinal  design  with  changing  treatment  can  be  very  efficient.  Clearly,  this 
discussion  extends  to  other  longitudinal  situations  in  which  covariates  are  changing 
over  time  (and  more  general  situations  with  covariate  variation  within  clusters). 
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8.4  Linear  Mixed  Models 
8.4.1  The  General  Framework 

The  basic  idea  behind  mixed  effects  models  is  to  assume  that  each  unit  has 
a  regression  model  characterized  by  both  fixed  effects,  that  are  common  to  all 
units  in  the  population,  and  unit-specific  perturbations,  or  random  effects.  “Mixed” 
effects  refers  to  the  combination  of  both  fixed  and  random  effects.  The  frequentist 
interpretation  of  the  random  effects  is  that  the  units  can  be  viewed  as  a  random 
sample  from  a  hypothetical  super-population  of  units.  A  Bayesian  interpretation 
arises  through  considerations  of  exchangeability  (Sect.  3.9),  as  we  discuss  further  in 
Sect.  8.6.2. 

Let  the  multiple  responses  for  the  ith  unit  be  Yt  =  [Yu, . . . ,  Yirii]T,  i  =  1 ,...  ,m. 
We  assume  that  responses  on  different  units  are  independent  but  that  there  is 
dependence  between  observations  on  the  same  unit.  Let  / 3  represent  a  (k  +  1)  x  1 
vector  of  fixed  effects  and  bi  a  (q  +  1)  x  1  vector  of  random  effects,  with  q  <  k.  In 
this  chapter,  we  assume  the  mean  for  Ytj  is  linear  in  the  fixed  and  random  effects. 
Let  Xij  =  [1  . . .  ,Xijk\  be  a  (k  +  1)  x  1  vector  of  covariates  measured  at 

occasion  j,  so  that  Xi  =  [xn, . . . ,  Xini]  is  the  design  matrix  for  the  fixed  effects  for 
unit  i.  Similarly,  let  Zij  =  [1,  Ziji, . . . ,  Zijq]T  be  a  {q  + 1)  x  1  vector  of  variables  that 
are  a  subset  of  so  that  z,  =  [zu , . . . ,  z.mt]T  is  the  design  matrix  for  the  random 
effects. 

We  describe  a  two-stage  linear  mixed  model  (LMM). 

Stage  One:  The  response  model,  conditional  on  random  effects  bi  is 

Vi  =  xt(5  +  ztbi  +  €i,  (8.7) 

where  e,  is  an  n,  x  1  zero-mean  vector  of  error  terms,  i  =  1, . . . ,  m. 

Stage  Two:  The  random  terms  in  (8.7)  satisfy 

E[ej]  =  0,  var(ei)  =  Ei(a), 

E [bi]  =  0,  var(bi)  =  D(ct), 
cov(6j,  ej/)  =  0,  i,i!  =  l,...,ro, 

where  a  is  an  r  x  1  vector  containing  the  collection  of  variance-covariance 
parameters.  Further,  cov(ej,  e.,/)  =  0  and  cov(6,.  b,/)  =  0,  for  i  /  i'. 

The  two  stages  may  be  collapsed,  by  averaging  over  the  random  effects,  to  give  the 
marginal  model : 

E  [Yi\  =  Xi(3 
var (Yi)  =  Vi  (a) 

=  ztD(a)zl  +  Ei(a ) 


(8.8) 
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for  i  =  1, ... ,  to,  so  that  Vi  (a)  is  an  m  x  n,  matrix.  The  random  effects  have 
therefore  induced  dependence  on  an  individual  through  the  first  term  in  (8.8). 
However,  responses  on  individuals  i  and  i',  i  /  i',  are  independent: 

cov(Yi,Yi.)  =  0 

where  cov(Yi,  T^/)  is  the  n,;  x  ny  matrix  with  element  (j,  j')  corresponding  to 
CO v(y)j  ,  1  ifjf  ),  j  1)  .  .  .  )  TOj  j  1;  .  .  •  5  Tli1  . 


8.4.2  Covariance  Models  for  Clustered  Data 


With  respect  to  model  (8.7),  a  common  assumption  is  that  b,  ~iid  Ng_|_i  (  0,  D)  and 
e,  ~ind  Nni  (  0,  Ei).  A  common  variance  for  all  individuals  and  at  all  measurement 
occasions,  along  with  uncorrelated  errors,  gives  the  simplified  form  Et  =  a2  I„4. 
We  will  refer  to  aj  as  the  measurement  error  variance,  but  as  usual,  the  error 
terms  may  include  contributions  from  model  misspecification,  such  as  departures 
from  linearity,  and  data  recording  errors.  The  inclusion  of  random  effects  induces 
a  marginal  covariance  model  for  the  data.  This  may  be  contrasted  with  the  direct 
specification  of  a  marginal  variance  model.  In  this  section  we  begin  by  deriving 
the  marginal  variance  structure  that  arises  from  two  simple  random  effects  models, 
before  describing  more  general  covariance  structures.  It  is  important  to  examine  the 
marginal  variances  and  covariances,  since  these  may  be  directly  assessed  from  the 
observed  data. 

We  first  consider  the  random  intercepts  only  model  z,  b,  =  1  n,  b,  with  var(/;,  )  = 
ijg,  along  with  E,  =  <j'f-lrH .  From  (8.8),  it  is  straightforward  to  show  that  this 
stipulation  gives  the  exchangeable  or  compound  symmetry  marginal  variance  model: 


var(T^)  =  a2 


1 

P 

P 


P 

1 

P 


P 

P 

1 


P 

P 

P 


,p  p  p  •••  1. 


where  a2  =  a2  +  a2  and  p  =  a2f  a2 .  In  this  case  we  have  two  variance  parameters 
so  that  a  =  [a f  a2].  K  consequence  of  between-individual  variability  in  intercepts 
is  therefore  constant  marginal  within-individual  correlation.  The  latter  must  be 
nonnegative  under  this  model  (since  >  0)  which  would  seem  reasonable  in  most 
situations. 

The  exchangeable  model  is  particularly  appropriate  for  clustered  data  with  no 
time  ordering  as  may  arise,  for  example,  in  a  split-plot  design,  or  for  multiple  mea¬ 
surements  within  a  family.  It  may  be  useful  for  longitudinal  data  also,  particularly 
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over  short  time  scales.  If  we  think  of  residual  variability  as  being  due  to  unmeasured 
variables,  then  the  exchangeable  structure  is  most  appropriate  when  we  believe  such 
variables  are  relatively  constant  across  responses  within  an  individual. 

We  now  consider  a  model  with  both  random  intercepts  and  random  slopes.  Such 
a  model  is  a  common  choice  in  longitudinal  studies.  With  respect  to  (8.7)  and  for 
i  =  1, ...  ,m,  the  first  stage  model  is 


’  yn  " 

'1  ti  1  ' 

'i  k i ' 

’  e-n  " 

Vi2 

1  ti2 

"A)' 

+ 

1  ti2 

bio 

+ 

€i2 

Pi 

bn 

-  Virii  - 

.  1  tirii  . 

.  1  tirii  - 

.  ^irii  - 

with  bi  =  [bi o,  buY  and  var(b;)  =  D  where 


D  = 


°o  foi 
001  0? 


Therefore,  <7o  is  the  standard  deviation  of  the  intercepts,  <j\  is  the  standard  deviation 
of  the  slopes,  and  <7oi  is  the  covariance  between  the  two.  This  model  induces  a 
marginal  variance  at  time  tij  which  is  quadratic  in  time: 

var  (Yij)  =  a\  +  Cg  +  2coiiij  +  t?j.  (8.9) 

The  marginal  correlation  between  observations  at  times  tij  and  to-  is 


Pjk  — 


aQ  +  ( tij  +  tik)a01  +  tijtikO"  i 


(cr2  +  <j2  _|_  2fjj(Joi  +  ^270i)1/2(fJ?  +  0Q  +  2^fc0oi  +  t2ikai)1/2 


(8.10) 


for  j,  k  =  1, . . . ,  rij,  j  ^  k.  Therefore,  the  assumption  of  random  slopes  has  induced 
marginal  correlations  that  vary  as  a  function  of  the  timings  of  the  measurements. 
After  a  model  is  fitted,  the  variances  (8.9)  and  correlations  (8.10)  can  be  evaluated  at 
the  estimated  variance  components  and  compared  to  the  empirical  marginal  variance 
and  correlations. 

In  a  longitudinal  setting,  an  obvious  extension  to  model  (8.7)  is  provided  by 

yi  =  Xi/3  +  Zibj  +  Si  +  €i,  (8.11) 

with  the  error  vectors  bi.  Si,  and  e,  representing  individual- specific  random  effects, 
serial  dependence,  and  measurement  error.  We  assume 


E[cj]  =  0,  var(ej)  =  cre2Ini 
E [bi]  =  0,  var(bj)  =  D 
E[6i]  =  0,  var(<5j)  =  <jgRt 
cov(bj,  ey)  =  0,  i,i'  =  1, . . .  ,ro, 
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cov(bi,8i>)  =  0,  i,i'  =  1, . . .  ,m, 

cov(8i,  £i')  =  0,  i,i'  =  1, ...  ,m, 

with  cov(ei,e,')  =  0,  cov(<5j,£j/)  =  0,  and  cov(br.  b,/)  =  0,  for  i  ^  i'.  Here, 
R,  is  an  rii  x  rij  correlation  matrix  with  elements  R.,  jk ,  for  j,k  =  1, . . . ,  n*  which 
correspond  to  within  individual  correlations. 

In  general,  it  is  difficult  to  identify/estimate  all  three  sources  of  variability,  but 
this  formulation  provides  a  useful  conceptual  model. 

We  now  discuss  specific  choices  of  i?,  ,  beginning  with  a  widely-used  time  series 
model,  the  first-order  autoregressive,  or  AR(1),  process.  We  assume  initially  that 
responses  are  observed  at  equally  spaced  times.  For  j  >  2  and  \p\  <  1  suppose 

Sij  =  p8i,j  —  l  T  Uij ,  (8.12) 

with  Ui  =  [uii, . . . ,  UimY,  E[«i]  =  0,  var(iij)  =  a2 Ini,  and  with  uij  independent 
of  all  other  error  terms  in  the  model.  We  first  derive  the  marginal  moments 
corresponding  to  this  model.  Repeated  application  of  (8.12)  gives,  for  k  >  0, 

Sij  —  U ij  -f -  pUi^j  —  i  T  p  Uij—2  "F  •  •  •  p  'Ui.j-k- (-1  ~F  P  — h  (8.13) 

so  that 


var (5^)  =  al{l  +  p2  +  p4  +  . . .  +  p2(k  1})  +  p2kyax(Sitj-k). 

Taking  the  limit  as  k  oo,  and  using  Ylili  3:1-1  =  (1  —  x)-1  for  |x|  <  1,  gives 

Var(%)  =  (1  1^2)  =  °S> 

which  is  the  marginal  variance  of  all  of  the  S  error  terms.  Using  (8.13), 
coy  (Si:i,  Sii:i-k)  =  E  [SijSij-k]  =  pkE[S2j_k]  =  pkyax(S2j_k) 


so  that  the  correlations  decline  as  observations  become  further  apart  in  time.  Under 
this  model,  the  correlation  matrix  of  Si  is 


P 

P 

P 


rJli  —  1 


rJli  —  2 


rJli  —  3 
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The  autoregressive  model  is  appealing  in  longitudinal  settings  and  contains  just  two 
parameters,  cr|  and  p.  The  model  can  be  extended  to  unequally  spaced  times  to  give 
covariance 

cov(5ij,6ik)  =  o-|/9|tw_*ife|.  (8.14) 


A  Toeplitz  model  assumes  the  variance  is  constant  across  time  and  that  responses 
that  are  an  equal  distance  apart  in  time  have  the  same  correlation.1  For  equally 
spaced  responses  in  time: 


'  1 

pi 

P2 

•  Prii  —  1 

Pi 

l 

Pi 

Pm-2 

P2 

pi 

1 

■  Pm- 3 

-Prii  —  1 

Pm- 2 

Pm- 3  • 

1 

This  model  may  be  useful  in  situations  in  which  there  is  a  common  design  across 
individuals,  which  allows  estimation  of  the  m  =  n  parameters  (n  —  1  correlations 
and  a  variance).  The  AR(1)  model  is  a  special  case  in  which  pk  =  pk ■ 

An  unstructured  covariance  structure  allows  for  different  variances  at  each 
occasion  er^ , . . . ,  aj)n  and  distinct  correlations  for  each  pair  of  responses,  that  is, 


'  1 
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P‘21 
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■  P2m 
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Pil 
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■  P3m 
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Pm  3  ■ 

1 

with  pjk  =  pkj ,  for  j,k  =  1 ,m.  This  model  contains  ni(m  +  l)/2  parameters 
per  individual,  which  is  a  large  number  if  rij  is  large.  If  one  has  a  common  design 
across  individuals,  it  may  be  plausible  to  fit  this  model,  but  one  would  still  need  a 
large  number  of  individuals  m,  in  order  for  inference  to  be  reliable.  As  usual,  there 
is  a  trade-off  between  flexibility  and  parsimony. 


8. 4.3  Parameter  Interpretation  for  Linear  Mixed  Models 

In  this  section  we  discuss  how  (3  and  b  may  be  interpreted  in  the  LMM;  this 
interpretation  requires  care,  as  we  illustrate  in  the  context  of  a  longitudinal  study 


'In  linear  algebra,  a  Toeplitz  matrix  is  a  matrix  in  which  each  descending  diagonal,  from  left  to 
right,  is  constant. 
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with  both  random  intercepts  and  random  slopes.  For  a  generic  individual  at  time  t, 
suppose  the  model  is 

E[Y  |  b,  t]  =  (Aj  +  M  +  (Pi  +  &i)(i  —  t) 

with  b  =  [bo,  f>\  ]'■  The  marginal  model  is 

E[Y\t]  =  p0+Pi(t-t). 

so  that  Pq  is  the  expected  response  at  t  =  t  and  the  slope  parameter  pi  is  the 
expected  change  in  response  for  a  unit  increase  in  time.  These  expectations  are  with 
respect  to  the  distribution  of  random  effects  and  are  averages  across  the  population 
of  individuals. 

For  a  generic  individual,  P0  +  b0  is  the  expected  response  at  t  =  t,  and  /3\  +  b\ 
is  the  expected  change  in  response  for  a  unit  increase  in  time.  In  a  linear  model, 
Pi  is  also  the  average  of  the  individual  slopes,  pi  +  b±.  Consequently,  since  the 
model  is  linear.  Pi  is  both  the  expected  change  in  the  average  response  in  unit  time 
(across  individuals)  and  the  average  of  the  individual  expected  changes  in  unit  time. 
An  alternative  interpretation  is  that  Pi  is  the  change  in  response  for  a  unit  change 
in  t  for  a  “typical”  individual,  that  is,  an  individual  with  bi  =  0.  In  Chap.  9  we  will 
illustrate  how  the  interpretation  of  parameters  in  mixed  models  becomes  far  more 
complex  when  the  model  is  nonlinear  in  the  parameters,  and  we  will  see  that  the 
consideration  of  a  typical  individual  is  particularly  useful  in  this  case. 


8.5  Likelihood  Inference  for  Linear  Mixed  Models 

We  now  turn  to  inference  and  first  consider  likelihood  methods  for  the  LMM 

Ui  =  XiP  +  Zibi  +  Cj. 

To  implement  a  likelihood  approach,  we  need  to  specify  a  complete  probability 
distribution  for  the  data,  and  this  follows  by  specifying  distributions  for  e,;  and  b,, 
i  =  1, . . . ,  to.  A  common  choice  is  e,  |  of  Nni  (  0,  tre2I„J  and  bi  \  D  ~iid 
Nq+i(  0,  D)  where 
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so  that  the  vector  of  variance-covariance  parameters  is  a  =  [a2 ,  D] .  The  marginal 
mean  and  variance  are 


E[li  |  0\  =  fJ,i(/3)  =  Xif3  (8.15) 

var(Yi  |  a)  =  Vj(a)  =  zzDzJ  +  a2Ini.  (8.16) 

We  have  refined  notation  in  this  section  to  explicitly  condition  on  the  relevant 
parameters.  In  general,  inference  may  be  required  for  the  fixed  effects  regression 

parameters  (3 ,  the  variance  components  a,  or  the  random  effects,  b  =  \b-t . bm]T. 

We  consider  each  of  these  possibilities  in  turn. 


8.5.1  Inference  for  Fixed  Effects 

Likelihood  methods  have  traditionally  been  applied  to  nonrandom  parameters,  and 
so,  we  integrate  over  the  random  effects  in  the  two-stage  model  to  give 

P{y  1/3 , «)  =  /  P{y  I  b,  (3 ,  a)  x  p(b  \  (3 ,  a)  db. 

Jb 

Exploiting  conditional  independencies,  we  obtain  the  simplified  form 

m  r. 

p{y  I  /3,a)  =  n  /  p{yi  |  bi,f3,  a2)  x  p(bi  \  D)  db, 

i=lJb * 

and  since  a  convolution  of  normals  is  normal,  we  obtain 

l/i  I  A  a  ~  NnJ/iiOS),  V^a)], 

where  the  marginal  mean  ^{(3)  and  variance  Vi  (a)  correspond  to  (8.15)  and  (8.16), 
respectively.  The  log-likelihood  is 

m  m 

K(3,ot)  =  --5>g|^(a)|  -  -  x,f3yVi(a)-1(yi  -  XiP). 

i= 1  i= 1 

(8.17) 

The  MLEs  for  /3  and  a  are  obtained  via  maximization  of  (8.17).  The  score  equations 
for  (3  are 


Q7  m  m 

qq  =YlXiVi~lYi  ~  J2XW~lx*P 

P  i=  1  z=l 

m 

=  '£x}Vr\Yi-xif3) 


(8.18) 
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and  yield  the  MLE 


m 


-1 


m 


YxlVija)  1xi 


Yxlvi(a)  1yi\,  (8.19) 


which  is  a  generalized  least  squares  estimator.  If  D  =  0,  then  V  =  of  Ijv  (where 
N  =  i  ni)’  ar*d  P  corresponds  to  the  ordinary  least  squares  estimator,  as  we 
would  expect.  The  variance  of  /3  may  be  obtained  either  directly  from  (8.19),  since 
the  estimator  is  linear  in  yi,  or  from  the  second  derivative  of  the  log-likelihood. 

The  expected  information  matrix  is  block  diagonal: 


(8.20) 


so  there  is  asymptotic  independence  between  /3  and  a  and  any  consistent  estimator 
of  a.  will  give  an  asymptotically  efficient  estimator  for  /3  (likelihood-based  estima¬ 
tion  of  a  is  considered  in  Sects.  8.5.2  and  8.5.3).  Since 


(8.21) 


the  observed  and  expected  information  matrices  coincide.  The  estimator  [3  is  linear 
in  the  data  Y;,  and  so  under  normality  of  the  data,  /3  is  normal  also.  Under  correct 
specification  of  the  variance  model,  and  with  a  consistent  estimator  3, 


as  m  — >  oo.  Since  /3  is  linear  in  Y ,  it  follows  immediately  that  this  asymptotic 
distribution  is  also  appropriate  when  the  data  and  random  effects  are  not  normal. 
We  require  the  second  moments  of  the  data  to  be  correctly  specified,  however.  In 
Sect.  8.7  we  describe  how  a  consistent  variance  estimator  may  be  obtained  when 
cov(  Yi ,  Yi'  |  a)  =  0,  but  var(Yj  |  a)  =  Yi  (a)  is  not  necessarily  correctly  specified. 

In  terms  of  the  asymptotics  it  is  not  sufficient  to  have  m  fixed  and  n,  ^  oc 
for  i  =  1 , ,m.  We  illustrate  for  the  LMM  with  Z{  =  x,.  in  which  case  Vi  = 
XiDx \  -  Under  this  setup. 
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where  we  have  used  the  matrix  identity  x ]Vi  1Xi  =  \fx\x)'  -v2  +  D]  1  (which 
may  be  derived  from  (B.3)  of  Appendix  B).  When  n,  — >■  oo, 

(xjxi)-1  =  0(n-x)  0, 

and  if  m  is  fixed, 

var(3)  , 

m 

showing  that  we  require  m  — >  oo  for  consistency  of  (3. 

Likelihood  ratio  tests  can  be  used  to  test  hypotheses  concerning  elements  of  f3, 
for  fixed  a  or,  in  practice,  the  substitution  of  an  estimate  a.  Various  t  and  /-’-like 
approaches  have  been  suggested  for  correcting  for  the  estimation  of  a ,  see  Verbeeke 
and  Molenberghs  (2000,  Chap.  6),  but  if  the  sample  size  m  is  not  sufficiently  large 
for  reliable  estimation  of  a,  we  recommend  resampling  methods,  or  following  a 
Bayesian  approach  to  inference,  since  this  produces  inference  for  (3  that  averages 
over  the  uncertainty  in  the  estimation  of  a. 

For  more  complex  linear  models,  inference  may  not  be  so  straightforward.  For 
example,  consider  the  model 

I  i.'j  ~  Xij/3  z ,  b,  A  e 7j  =  fij  j  €-ij 

but  with  nonconstant  measurement  error  variance.  A  common  model  is  var(l \j)  = 
a2//-k.  for  known  7  >  0.  In  this  case  the  MLE  for  (3  is  not  available  in  closed  form, 
and  we  do  not  have  a  diagonal  information  matrix  as  in  (8.20).  An  example  of  the 
fitting  of  such  a  model  in  a  nonlinear  setting  is  given  at  the  end  of  Sect.  9.20. 

Maximum  likelihood  estimation  is  also  theoretically  straightforward  for  the 
extended  model  (8. 1 1)  in  which  we  have  a  richer  variance  model,  but  identifiability 
issues  may  arise  due  to  the  complexity  of  the  error  structure. 


8.5.2  Inference  for  Variance  Components  via  Maximum 
Likelihood 

The  MLE  a.  is  obtained  from  maximization  of  (8.17),  but  in  general,  there  is  no 
closed-form  solution.  However,  the  expectation-maximization  (EM,  Dempster  et  al. 
1977)  or  Newton-Raphson  algorithm  may  be  applied  to  the  profile  likelihood: 

lp(a)  =  max((/3,  a)  =  -|log|V(a)|  -  ^(y  -  xf3 )TV(a)~1(y  -  x@), 

since  recall  from  Sect.  2.4.2  that  the  MLE  for  a  is  identical  to  the  estimate  obtained 
from  the  profile  likelihood.  Under  standard  likelihood  theory, 

Ji/2(a-a)  Nr(  0,  Ir), 
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where  r  is  the  number  of  distinct  elements  of  a.  This  distribution  provides 
asymptotic  confidence  intervals  for  elements  of  a. 

Testing  whether  random  effect  variances  are  zero  requires  care  since  the  null 
hypothesis  lies  on  the  boundary,  and  so,  the  usual  regularity  conditions  are  not 
satisfied.  We  illustrate  by  considering  the  model 


Yij  —  A>  +  Xijf3  +  bi  +  €ij 


with  bi  |  <Tq  ~  N(0,  Oq).  Suppose  we  wish  to  test  whether  the  random  effects 
variance  is  zero,  that  is,  H0  :  erg  =  0  versus  Hi  :  o-g  >  0.  In  this  case,  the 
asymptotic  null  distribution  is  a  50:50  mixture  of  Xo  and  Xi  distributions,  where  the 
former  is  the  distribution  that  gives  probability  mass  1  to  the  value  0.  For  example, 
the  95%  points  of  a  Xi  and  the  50:50  mixture  are  3.84  and  2.71,  respectively. 
Consequently,  if  the  usual  Xi  distribution  is  used,  the  null  will  be  accepted  too 
often,  leading  to  a  variance  component  structure  that  is  too  simple. 

The  intuition  behind  the  form  of  the  null  distribution  is  the  following.  Estimating 
<Jq  is  equivalent  to  estimating  p  =  (Tq/ct2  and  setting  p  =  0  if  the  estimated 
correlation  is  negative,  and  under  the  null,  this  will  happen  half  the  time.  If  p  =  0, 
then  we  recover  the  null  for  the  distribution  of  the  data,  and  so,  the  likelihood  ratio 
will  be  1.  This  gives  the  mass  at  the  value  0,  and  combining  with  the  usual  Xi 
distribution  gives  the  50:50  mixture. 

If  H0  and  Hi  correspond  to  models  with  k  and  k+1  random  effects,  respectively, 
each  with  general  covariance  structures,  then  the  asymptotic  distribution  is  a  50:50 
mixture  of  and  Xk+i  distributions.  Hence,  for  example,  if  we  wish  to  test  random 
intercepts  only  versus  correlated  random  intercepts  and  random  slopes  (with  D 
having  elements  of.  er01,  erf),  then  the  distribution  of  the  likelihood  ratio  statistic  is 
a  50:50  mixture  of  Xi  and  xi  distributions.  Similar  asymptotic  results  are  available 
for  more  complex  models/hypotheses;  see,  for  example,  Verbeeke  and  Molenberghs 
(2000). 


8.5.3  Inference  for  Variance  Components  via  Restricted 
Maximum  Likelihood 

While  MLE  for  variance  components  yields  consistent  estimates  under  correct 
model  specification,  the  estimation  of  (3  is  not  acknowledged,  in  the  sense  that 
inference  proceeds  as  if  (3  were  known.  We  have  already  encountered  this  in 
Sect.  2.4.2  for  the  simple  linear  model  where  it  was  shown  that  the  MLE  of  o1 
is  RSS/n,  while  the  unbiased  version  is  RSS/(n  —  k  —  1),  where  RSS  is  the 
residual  sum  of  squares  and  k  is  the  number  of  covariates.  An  alternative,  and  often 
preferable,  method  that  acknowledges  estimation  of  /3  is  provided  by  restricted 
(or  residual)  maximum  likelihood  (REML).  We  provide  a  Bayesian  justification 
for  REML  in  Sect.  8.6  and  here  provide  another  derivation  based  on  marginal 
likelihood. 
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Recall  the  definition  of  marginal  likelihood  from  Sect.  2.4.2.  Let  S i,  S2,  be 
minimal  sufficient  statistics  and  suppose 

p{y  I  A,  4>)  (X  p(su  s2  I  A,  4>)  =  p(s1  I  A )p(s2  I  Si,  A,  <j>)  (8.22) 

where  A  represents  the  parameters  of  interest  and  <fi  the  remaining  (nuisance) 
parameters.  Inference  for  A  may  be  based  on  the  marginal  likelihood  Lm( A)  = 
p(si  |  A).  We  discuss  how  marginal  likelihoods  may  be  derived  for  general  LMMs. 

To  derive  a  marginal  likelihood,  we  need  to  find  a  function  of  the  data,  U  = 
f(Y),  whose  distribution  does  not  depend  upon  (3.  We  briefly  digress  to  discuss 
an  error  contrast,  CTY,  which  is  defined  by  the  property  that  E[CTY"]  =  0  for  all 
values  of  (3,  with  C  an  TV-dimensional  vector.  For  the  LMM 

E[CtT"]  =  0  for  all  f3  if  and  only  if  CTx  =  0. 


When  CTx  =  0, 

CTY  =  CTzb  +  CTe, 

which  does  not  depend  on  f3,  suggesting  that  the  marginal  likelihood  could  be 
based  on  error  contrasts.  If  x  is  of  full  rank,  that  is,  is  of  rank  k  +  1,  there 
are  exactly  N  —  k  —  1  linearly  independent  error  contrasts  (since  k  +  1  fixed 
effects  have  been  estimated,  which  induces  dependencies  in  the  error  contrasts). 
Let  B  =  [Ci, . . . ,  Cjv-fc-i]  denote  an  error  contrast  matrix.  Given  two  error 
contrast  matrices  B\  and  B2,  it  can  be  shown  that  there  exists  a  full  rank, 
(N—k  —  l)x  (N  —  k—  1 )  matrix  A  such  that  AB\  =  AB'2 .  Therefore,  likelihoods 
based  on  B\Y  or  on  B2Y  will  be  proportional,  and  estimators  based  on  either  will 
be  identical.  Let  H  =  x(xTx)~1xJ ,  and  choose  B  such  that  I  —  H  =  BB'  and 
I  =  BTB.  It  is  easily  shown  that  B  is  an  error  contrast  matrix  since 

BTx  =  BTBBTx  =  BT(I  -  H)x  =  0. 

The  function  of  the  data  we  consider  is  therefore  U  =  B  Y  which  may  be 
written  as 

U  =  BTY  =  BTBBTY  =  BT(1  -  H)Y  =  BTr , 

where  r  =  Y  xf30,  and  /30  =  (x'x)  lx'Y  is  the  OLS  estimator,  showing  that 
BTY  is  a  linear  combination  of  residuals  (hence  the  name  “residual”  maximum 
likelihood).  Since  BTx  =  0,  we  can  confirm  that 


U  =  BTY  =  BTzb  +  BTe, 

with  E[£7]  =  0.  Further,  the  distribution  of  U  does  not  depend  upon  (3,  as  required 
for  a  marginal  likelihood. 
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We  now  derive  the  distribution  of  U  by  considering  the  transformation  from 
Y  [U,  0a]  =  [B^Y,  GTY],  where 


3G  =  gty  =  (ajTy-1®)-1*Tv_1y 


is  the  generalized  least  squares  (GLS)  estimator.  We  derive  the  Jacobian  of  the 
transformation,  using  (B.l)  and  (B.2)  in  Appendix  B: 


|J1 


d(U,A) 

dY 


I B  G  | 


1/2 


[B  G] 


=  |  BTB  |1/2|  GTG  -  GTB(BJB)~1BTG  |1/2 
=  lx  |  GTG  -  GT( I  -  H)G  |1/2 
=  GTHG=  |  xTx  |“1/2^  0 


which  implies  that  [U,  f3a]  is  of  full  rank  (and  equal  to  N).  The  vector  [U,  /3G]  is  a 
linear  combination  of  normals  and  so  is  normal,  and 

cov(C7 ,  3g)  =  E[[7  (3g  —  /3)T] 

=  E [BTYYTG\  -  E  [BTY  -  /3T] 

=  BJ  [var (Y)  +  E{Y )E(FT)]  G  +  BTx/3  -  f3T 
=  BTVGT  +  BTx(3(x(3Y 
=  0, 

where  we  have  repeatedly  used  BTx  =  0  and  V  =  var(Wj.  So  U  and  f3a  are 
uncorrelated  and,  since  they  are  normal,  independent  also.  Consequently, 

p(Y\ex,(3)=p(U,PG\cx,(3)\J\ 

=  p(U\P0,/3)p(pa\a,l3)\J\ 

=  p(U  I  ot)p{ 3G  |  a,/3)  | J\.  (8.23) 


By  comparison  with  (8.22),  we  have  Sj  =  U,  s2  =  /3G,  A  =  a,  and  =  (3,  and 
p(U  |  a)  is  a  marginal  likelihood.  Rearrangement  of  (8.23)  gives 


p(U  |  a) 


p(y  I  iji-i 

P0 G  I  a>/3) 


Since 


p(y  I  OL,0)  =  (2tt)  n/2\V\  1/2  exp 


-0y  -  xpyv  1(y  —  x(3) 
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and 


p( 3g  I  a,f3) 


(27r)-(fe+1)/2|a:TV-1a:|1/2  exp 


(3)TxTV  %x0a 


0) 


we  obtain  the  marginal  likelihood 


,TT  |  ^  |:rTa:|1/2|V|  1/2 

P(U  a)  =  c  exp 

\xTV  '-xyl1 


-^(y  -  xPaYv  1(y-  xPg) 


with  c  =  (27r)_^Ar~fe_1)/2,  which  (as  already  mentioned)  does  not  depend  upon  B. 
Hence,  we  can  choose  any  linearly  independent  combination  of  the  residuals. 

The  restricted  log-likelihood  upon  which  inference  for  a  may  be  based  is 

Ir(q 0=  -  \  loS  WV (a.)~1x\—^  log  |V(a)| ~(y  -  x^a) TVr(a)"1(y  -  xfc). 
Comparison  with  the  profile  log-likelihood  for  a, 

lP(a)  =  -^log|V(a)|  -  a;3G)TV(a)_1(y-a;30), 

shows  that  we  have  an  additional  term,  —  |  log  \xTV (a)-1  x\,  that  may  be  viewed 
as  accounting  for  the  degrees  of  freedom  lost  in  estimation  of  (3.  Computationally, 
finding  REML  estimators  is  as  straightforward  as  their  ML  counterparts,  as  the 
objective  functions  differ  simply  by  a  single  term.  Both  ML  and  REML  estimates 
may  be  obtained  using  EM  or  Newton-Raphson  algorithms;  see  Pinheiro  and  Bates 
(2000)  for  details. 

In  general,  REML  estimators  have  finite  sample  bias,  but  they  are  less  biased  than 
ML  estimators,  particularly  for  small  samples.  So  far,  as  estimation  of  the  variance 
components  are  concerned,  the  asymptotic  distribution  of  the  REML  estimator  is 
normal,  with  variance  given  by  the  inverse  of  the  Fisher’s  information  matrix,  where 
the  latter  is  based  on  Ir(ol). 

REML  is  effectively  based  on  a  likelihood  with  data  constructed  from  the 
distribution  of  the  residuals  y  —  x(3a.  Therefore,  when  two  regression  models 
are  to  be  compared,  the  data  under  the  two  models  are  different;  hence,  REML 
likelihood  ratio  tests  for  elements  of  (3  cannot  be  performed.  Consequently,  when 
a  likelihood  ratio  test  is  required  to  formally  compare  two  nested  regression  models, 
maximum  likelihood  must  be  used  to  fit  the  models.  Likelihood  ratio  tests  for 
variance  components  are  valid  under  restricted  maximum  likelihood,  however,  since 
the  covariates,  and  hence  residuals,  are  constant  in  both  models. 


Example:  One-Way  AN OVA 

The  simplest  example  of  a  LMM  is  the  balanced  one-way  random  effects  ANOVA 
model: 


Yij  —  /3o  +  bi  +  Cij , 
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with  bt  and  e,y  independent  and  distributed  as  bi  |  a2  ^ha  N(0,  Cg)  and  e,7  | 
of  ~iid  N(0,  of),  with  n  observations  on  each  unit  and  i  =  1, . . . ,  m  to  give  N  = 
nm  observations  in  total.  In  this  example,  (3  =  /?0  and  a  =  [of ,  erg].  This  model 
was  considered  briefly  in  Sect.  5.8.4. 

The  model  can  be  written  in  the  form  of  (8.7)  as 


Vi  —  In  A)  +  1  nbi  +  £i 


where  yi  =  [ya, . . . ,  yin}T  and  e,:  =  [en, ...,  ein]T.  Marginally,  this  specification 
implies  that  the  data  are  normal  with  E[Y  \  (3\  =  1jv/3o  and  var(T^  |  a)  = 
diag(Vi, . . . ,  Vm)  where 


Vi=  ln  Ynal  +  I„cr^, 

for  i  =  1, . . . ,  to.  In  the  case  of  n  =  3  observations  per  unit,  this  yields  the  TV  x  N 
marginal  variance 
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1. 

where  a2  =  of  +  a2  is  the  marginal  variance  of  each  observation,  and 


P  = 


2 

0 


is  the  marginal  correlation  between  two  observations  on  the  same  unit.  The 
correlation,  p,  is  induced  by  the  shared  random  effect  and  is  referred  to  as  the  intra¬ 
class  correlation  coefficient. 

For  some  data/mixed  effects  model  combinations,  there  are  more  combined  fixed 
and  random  effects  than  data  points,  which  is  at  first  sight  disconcerting,  but  the 
random  effects  have  a  special  status  since  they  are  tied  together  through  a  common 
distribution.  In  the  above  ANOVA  model,  we  have  m  +  3  unknown  quantities  if 
we  include  the  random  effects,  but  these  random  effects  may  be  integrated  from 
the  model  so  that  the  distribution  of  the  data  may  be  written  in  terms  of  the  three 
parameters,  [f3 o,  <Tg,  a2]  only,  without  reference  to  the  random  effects,  that  is, 


Y  |  /30 ,  Og  ,  07 


Njv[  lp0,V(ala2)]. 
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A  fixed  effects  model  with  a  separate  parameter  for  each  group  has  ro+1  parameters, 
which  shows  that  the  mixed  effects  model  can  offer  a  parsimonious  description. 

The  MLE  for  Ao  is  given  by  the  grand  mean,  i.e.,  Ao  =  Y  .  With  balanced  data 
the  ML  and  REML  estimators  for  the  variance  components  are  available  in  closed 
form  (see  Exercise  8.2).  We  define  the  between-  and  within-group  mean  squares  as 

MSE=SSj£(-wyi'ja. 

m  —  1  m(n  —  1) 

The  MLEs  of  the  variance  components  are 


at  =  MSE, 


erg  =  max  I  0 


(1  -  l/m)MSA  —  MSE\ 
n  ) 


The  REML  for  a\  is  the  same  as  the  MLE,  but  the  REML  estimate  for  Oq  is 

.  MSA  —  MSE\ 
a0  =  max  |  0, -  1  , 


n 


which  is  slightly  larger  than  the  ML  estimate,  having  accounted  for  the  estimation 
of  Ao-  Notice  that  the  ML  and  REML  estimators  for  CTq  may  be  zero. 


Example:  Dental  Growth  Curves 

We  consider  the  full  data  and  fit  a  model  with  distinct  fixed  effects  (intercepts  and 
slopes)  for  boys  and  girls  and  with  random  intercepts  and  slopes  but  with  a  common 
random  effects  distribution  for  boys  and  girls.  Specifically,  at  stage  one. 


Yij  —  (Ao  +  bio)  +  (At  +  bn)tj  +  Cij 

for  boys,  i  =  1, . . . ,  16,  and 

=  (Ao  +  A  2  +  bio )  +  (Ai  +  A4  +  bn)tj  +  Cij 

for  girls,  i  =  17, . . . ,  27.  At  stage  two. 


bi  = 

bio 

|  D  ~lid  N2  (  0,  D  ) ,  D  = 

°o  001 

2  2 

M 1. 

koi  01  J 

for  i  =  1, . . . ,  27.  We  take  [L ,  £2,  A3,  A]  =  [—2,  —1, 1,  2]  so  that  we  have  centered 
by  the  average  age  of  11  years.  In  the  generic  notation  introduced  in  Sect.  8.4,  the 
above  model  translates  to 


Yij  —  Xij  (3  ~\~  Zij  bi  +  Cij 


374 


8  Linear  Models 


where  (3  =  [/3q,  /3i,  /?2 ,  /?3]T,  and  the  design  matrices  for  the  fixed  and  random 
effects  are 

=  f  [M*  0,0]  for  i  =  1, . . . ,  16 
13  l  [Ml,  Mil  fori  =  17,... ,27, 

and  Zij  =  [Mi],  where  j  =  1,2,  3, 4.  Therefore,  /?o  is  the  average  tooth  length 
at  1 1  years  for  boys,  j3\  is  the  slope  for  boys  (specifically  the  average  change  in 
tooth  length  between  two  populations  of  boys  whose  ages  differ  by  1  year),  is 
the  difference  between  the  average  tooth  lengths  of  girls  and  boys  at  1 1  years,  and 
/?3  is  the  average  difference  in  slopes  between  girls  and  boys.  The  intercept  random 
effects  b.,0  may  be  viewed  as  the  accumulation  of  all  unmeasured  variables  that 
contribute  to  the  tooth  length  for  child  i  differing  from  the  relevant  (boy  or  girl) 
population  average  length  (measured  at  1 1  years).  The  slope  random  effects  bn  are 
the  child  by  time  interaction  terms  and  summarize  all  of  the  unmeasured  variables 
for  child  %  that  lead  to  the  rate  of  change  in  growth  for  this  child  differing  from  the 
relevant  (boy  or  girl)  population  average. 

Fitting  this  model  via  REML  yields 

3  =  [25, 0.78,  —2.3,  — 0.31]t 

with  standard  errors 

[0.49,0.086,0.76,0.14], 

The  asymptotic  95%  confidence  interval  for  the  average  difference  in  tooth  lengths 
at  1 1  years  is  [—3.8,  —0.83],  from  which  we  conclude  that  the  average  tooth  lengths 
at  1 1  years  is  greater  for  boys  than  for  girls.  The  95%  interval  for  the  slope  difference 
is  [—0.57,  —0.04]  suggesting  that  the  average  rate  of  growth  is  greater  for  boys  also. 

There  are  a  number  of  options  to  test  whether  gender-specific  slopes  are  required, 
that  is,  to  decide  on  whether  /?4  =  0.  A  Wald  test  using  the  REML  estimates  gives 
a  p- value  of  0.026  (so  that  one  endpoint  of  a  97.4%  confidence  interval  is  zero), 
which  conventionally  would  suggest  a  difference  in  slopes.  To  perform  a  likelihood 
ratio  test,  we  need  to  carry  out  a  fit  using  ML,  since  REML  is  not  valid,  as  explained 
in  Sect.  8.5.2.  Fitting  the  models  with  and  without  distinct  slopes  gives  a  change 
in  twice  the  log-likelihood  of  5.03,  with  an  associated  p-value  of  0.036,  which  is 
consistent  with  the  Wald  test.  Hence,  there  is  reason  to  believe  that  the  slopes  for 
boys  and  girls  are  unequal,  with  the  increase  in  the  average  growth  over  1  year  being 
estimated  as  0.3  mm  greater  for  boys  than  for  girls. 

The  estimated  variance-covariance  matrices  of  the  random  effects,  D,  under 
REML  and  ML  are 


1.842  0.21  x  1.84  x  0.18 

0.21  x  1.84  x  0.18  0.182 


1.752  0.23  x  1.75  x  0.15 

0.23  x  1.75  x  0.15  0.152 


and 
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so  that,  as  expected,  the  REML  estimates  are  slightly  larger.  Although  (3  depends 
on  D ,  the  point  estimates  of  f3  are  identical  under  ML  and  REML  here,  because  of 
the  balanced  design.  The  standard  errors  for  elements  of  (3  are  slightly  larger  under 
REML,  due  to  the  differences  in  V. 

Under  REML,  the  estimated  standard  deviations  of  the  distributions  of  the 
intercepts  and  slopes  are  ixo  =  1.84  and  d\  =  0.18,  respectively.  Whether  these 
are  “small”  or  “not  small”  relates  to  the  scale  of  the  variables  with  which  they 
are  associated.  Interpretation  of  elements  of  D  depends,  in  general,  on  how  we 
parameterize  the  time  variable.  For  example,  if  we  changed  the  time  scale  via  a 
location  shift,  we  would  change  the  definition  of  the  intercept.  As  parameterized 
above,  the  off-diagonal  term  Hoi  describes  the  covariance  between  the  child- 
specific  responses  at  1 1  years  and  the  child-specific  slopes  (the  REML  estimates 
of  the  correlation  between  these  quantities  is  0.23). 

Suppose  we  reparameterize  stage  one  of  the  model  as 


E [Yij  |  &*]  —  (/3q  +  b* 0)  +  (f3i  +  ba)tj 

with  =  [8,10,12,14]  and  b*  =  [6*0,&a]T-  Then  =  /30  -  pit, 

b* o  =  h o  -  but,  and 


D*00  —  Dqq  —  2£Hoi  +  t  Du 

-Dgi  =  Hoi  —  tDn 

D\i=Dn. 

Consequently,  only  the  interpretation  of  the  variance  of  the  slopes  remains  un¬ 
changed,  when  compared  with  the  previous  parameterization. 

We  return  to  the  original  parameterization  and  examine  further  the  fitting  of 
this  model.  Since  we  have  assumed  a  common  measurement  error  variance  of, 
and  common  random  effects  variances  D  for  boys  and  girls,  the  implied  marginal 
standard  deviations  and  correlations  are  the  same  for  boys  and  girls  and  may 
be  estimated  from  (8.9)  and  (8.10).  Under  REML,  ae  =  1.31  and  the  standard 
deviations  (on  the  diagonal)  and  correlations  (on  the  off-diagonal)  are 


'2.23 
0.65  2.23 
0.64  0.65  2.30 
.0.62  0.65  0.68  2.35 


(8.24) 


We  see  that  the  standard  deviations  increases  slightly  over  time,  and  the  correlations 
decrease  only  slightly  for  observations  further  apart  in  time,  suggesting  that  the 
random  slopes  are  not  contributing  greatly  to  the  fit.  Fitting  a  random-intercepts- 
only  model  to  these  data  produced  a  marginal  variance  estimate  of  2.282  and 
common  within-child  correlations  of  0.63. 
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The  empirical  standard  deviations  and  correlations  for  boys  and  girls  are  given, 
respectively,  by 


2.45 

0.44  2.14 
0.56  0.39  2.65 
0.32  0.63  0.59  2.09 


2.12 

0.83  1.90 
0.86  0.90  2.36 
0.84  0.88  0.95  2.44 


which  suggests  that  our  model  needs  refinement,  since  clearly  the  correlations  for 
girls  are  greater  than  for  boys. 


8.5.4  Inference  for  Random  Effects 

In  some  situations,  interest  will  focus  on  inference  for  the  random  effects.  For 
example,  for  the  dental  data,  we  may  be  interested  in  the  growth  curve  of  a  particular 
child.  Estimates  of  random  effects  are  also  important  for  model  checking. 

Various  approaches  to  inference  for  random  effects  have  been  proposed.  The 
simplest,  which  we  describe  first,  is  to  take  an  empirical  Bayes  approach.  From  a 
Bayesian  standpoint,  there  is  no  distinction  inferentially  between  fixed  and  random 
effects  (the  distinction  is  in  the  priors  that  are  assigned).  Consequently,  inference  is 
simply  based  on  the  posterior  distribution  p(bi  \  y).  Consider  the  LMM 


Vi  =  Xi/3  +  Zibi  +  e 


and  assume  bi  and  e7;  are  independent  with  bi  \  D  ~iid  Nq+i(  0,  D)  and  €f  | 
of  Nr,j: (  0,  of  I),  so  that  a  =  [afD\.  We  begin  by  considering  the  simple, 

albeit  unrealistic,  situation,  in  which  /3  and  a  are  known.  Letting  y*  =  y,  —  Xi(3, 
we  have 

p(bi  |  yi}  (3,  a)  oc  p(yi  \  bi}(3,a)  x  n(bt  \  a) 


which  we  recognize  as  a  multiple  linear  regression  with  a  zero-centered  normal  prior 
on  the  parameters  b,  (this  model  is  closely  linked  to  that  used  in  ridge  regression, 
see  Sect.  10.5.1).  Using  a  standard  derivation,  (5.7), 


bi  |  yi,(3,a  ~  Ng+i  [E  (bi  \  yu  (3,  a),  var(6i  |  yz,(3,a)] 


with  mean  and  variance 


(8.25) 
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var (bi  |  yi,  f 3 ,  a)  =  +  D  ^ 

=  D-  DzlVr'ziD,  (8.26) 

see  Exercise  8.4.  As  we  will  see  in  this  section,  the  estimate  (8.25)  may  be  derived 
under  a  number  of  different  formulations. 

A  fully  Bayesian  approach  would  consider 


P(b\y)  =  J  j  p(b,  f3,a\y)  d(3da, 

which  emphasizes  that  the  uncertainty  in  /3,  a  is  not  acknowledged  in  the  derivation 
of  (8.25)  and  (8.26). 

We  now  demonstrate  how  we  may  account  for  estimation  of  f3  with  a  flat  prior 
on  (3  and  assuming  a  known.  The  posterior  mean  and  variance  of  / 3  are 

E[f3  |  y,a\=  /3G 
var(/3  |  y,  a)  =  (xIV~1x)~l 

where  f3a  is  the  GLS  estimator  (these  forms  are  derived  for  more  general  priors  later, 
see  (8.35)  and  (8.36)).  Consequently, 

E [bi  |  y,  a]  =  E0ly^a  [E(6,  |  y ,  a)] 

=  Dz!V-1(yi-xi0  G)  (8.27) 

var(6j  |  y,a)  =  E|g|3/iC([var(6i  |  /3,y,a)]+vaip\yta(E[bi  \  y,a]) 

=  Ep\ y>a[D  -  DzJVr'ziD]  +var/}lyta(Dz]V-1(yi  -  xt(3 )) 

=  D  -  DzJV~1ziD  +  DzJVi~1Xi(xTV~1x)~1xTiV~1ZiD. 

(8.28) 

Therefore,  we  can  easily  account  for  the  estimation  of  (3,  but  no  such  simple 
development  is  available  to  account  for  estimation  of  a. 

From  a  frequentist  perspective,  inference  for  random  effects  is  often  viewed  as 
prediction  rather  than  estimation,  since  [&i , . . . ,  bm\  are  random  variables  and  not 
unknown  constants.  Many  different  criteria  may  be  used  to  find  a  predictor  b  = 
f(Y)  of  b,  for  a  generic  unit. 

We  begin  by  defining  the  optimum  predictor  as  that  which  minimizes  the  mean 
squared  error  (MSE).  Let  b*  represent  a  general  predictor  and  consider  the  MSE: 


MSE(6*)  =  E1/,b[(6*  -  b)TA{b*  -  b)], 
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where  we  emphasize  that  the  expectation  is  with  respect  to  both  y  and  b,  and  A 
is  any  positive  definite  symmetric  matrix.  We  show  that  the  MSE  is  minimized  by 
b  =  E[fo  |  y].  For  the  moment,  we  suppress  the  dependence  on  any  additional 
parameters.  We  can  express  the  MSE  in  terms  of  b*  and  b: 

MSE (6*)  =  EY,b[(b*  -  b)TA(b*  -  6)] 

=  Ey ,{,[(&*  -b  +  b-  b)TA(b*  -b  +  b-b)} 

=  Ey,b[(6*  -  bYA(b*  -b)}  +  2  x  EY,b[(b*  ~  b)A(b  -  6)] 

+  Ey.b[(6-6)TA(b-6)].  (8.29) 

The  third  term  does  not  involve  b *,  and  we  may  write  the  second  expectation  as 

E Y,b[(b*  -  b)A(b  -  b )]  =  Ev{Eb|y[(b*  -  b)A(b  -  b)  \  y}} 

=  EY[(b*  -b)A(b-b)}  =  0 

and  so,  minimizing  MSE  corresponds  to  minimizing  the  first  term  in  (8.29).  This 
quantity  must  be  nonnegative,  and  so,  the  solution  is  to  take  6*  =  b.  The  latter  is  the 
solution  irrespective  of  A.  So  the  best  prediction  is  that  which  estimates  the  random 
variable  b  by  its  conditional  mean.  We  now  examine  properties  of  b. 

The  usual  frequentist  optimality  criteria  for  a  fixed  effect  6  concentrate  upon 
unbiasedness  and  upon  the  variance  of  the  estimator,  var(0),  see  Sect.  2.2.  When 
inference  is  required  for  a  random  effect  b,  these  criteria  need  adjustment.  Specifi¬ 
cally,  an  unbiased  predictor  b  is  such  that 

E[S  -b]=  0, 

to  give 

E  [b]  =  E[6] 

so  that  the  expectation  of  the  predictor  is  equal  to  the  expectation  of  the  random 
variable  that  it  is  predicting.  For  b  =  E\b  \  y]. 


Ey[S]  =  Ey [Eb|y(fo  |  y)\  =  Eb[6] 


where  the  first  step  follows  on  substitution  of  b  and  the  second  from  iterated 
expectation;  therefore,  we  have  an  unbiased  predictor.  We  emphasize  that  we  do 
not  have  an  unbiased  estimator  in  the  usual  sense,  and  in  general,  b  will  display 
shrinkage  toward  zero,  as  we  illustrate  in  later  examples. 

The  variance  of  a  random  variable  is  defined  with  respect  to  a  fixed  number,  the 
mean.  In  the  context  of  prediction  of  a  random  variable,  a  more  relevant  summary 
of  the  variability  is 

var(6  —  b)  =  var(6)  +  var(6)  —  2  x  co v(6,  b). 
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If  this  quantity  is  small,  then  the  predictor  and  the  random  variable  are  moving  in  a 
stochastically  similar  way.  We  have 

C0Vb,b(M)  =  Ey  [cov(S,  b  |  y)\  +  covv(E[b  |  y],E[b  \  y]) 

=  Ev[cov(6,  b  |  y)\  +  co v-^(6,  b) 

=  var  (6),  (8.30) 

since  the  first  term  in  (8.30)  is  the  covariance  between  the  constant  E[b  |  y]  (since 
y  is  conditioned  upon)  and  b ,  and  so  is  zero.  To  obtain  the  form  of  the  second  term 
in  (8.30),  we  have  used  E[b  \  y]  =  E[E[b  |  y]  \  y\  =  b.  Hence, 

var(6  —  b)  =  var(b)  —  var(6)  =  D  —  var(6). 

In  order  to  determine  the  form  of  b  =  E[b  |  y]  and  evaluate  var(6  —  b),  we  need 
to  provide  more  information  on  the  model  that  is  to  be  used,  so  that  the  form  of 
p(b  |  y)  can  be  determined. 

For  the  LMM, 


bi 

Yi 


D  Dz]~\\ 

ZiD  Vi  \) 


since 


cov(6 i,Yi)  =  cov(bi,  Xi(3  +  ztbi  +  e4)  =  co v(6i,zibi)  =  Dz'i: 


(Appendix  B),  and  similarly,  cov(Yi,  b, )  =  z,D.  The  conditional  distribution  of  a 
multivariate  normal  distribution  is  normal  also  (Appendix  D)  with  mean 

bi  =  E [bi  |  y.j\  =  DzJV-\yi  -  xt(3)  (8.31) 

which  coincides  with  the  Bayesian  derivation  earlier,  (8.25).  From  a  frequentist 
perspective,  (8.25)  is  known  as  the  best  linear  unbiased  predictor  (BLUP),  where 
unbiased  refers  to  it  satisfying  E[bj]  =  E[b,;]. 

The  form  (8.31)  is  not  of  practical  use  since  it  depends  on  the  unknown  (3  and  or, 
instead,  we  use 


bi  =  E [bi  |  yt,'(3,  a\  =  Dz-  V,  1(yi  -  x,J30)  (8.32) 

where  D  =  D(a)  and  V  =  V(a).  The  implications  of  the  substitution  of  (3a  are 
not  great,  since  it  is  an  unbiased  estimator  and  appears  in  (8.31 )  in  a  linear  fashion, 
but  the  use  of  a  is  more  problematic.  In  particular  the  predictor  bi  is  no  longer  linear 
in  the  data,  so  that  exact  properties  can  no  longer  be  derived. 
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The  uncertainty  in  the  prediction,  accounting  for  the  estimation  of  f3,  is 
var(6,  -bi)  =  D  -  var(b,:) 

=  D-  DzJVr'ziD  +  DzJV^x^V^x^xlV^z.D 

after  tedious  algebra  (Exercise  8.5),  so  that  (8.27)  is  recovered.  We  again  emphasize 
that  this  estimate  of  variability  of  prediction  does  not  acknowledge  the  uncertainty 
in  a.  Given  correct  specification  of  the  marginal  variance  model,  varfW  |  a)  = 
V(a),  and  a  consistent  estimator  of  a,  bi  is  asymptotically  normal  with  a  known 
distribution,  which  can  be  used  to  form  interval  estimates.  As  an  alternative  to  the 
use  of  (8.25),  we  can  implement  a  fully  Bayesian  approach  (Sect.  8.6),  though  no 
closed-form  solution  emerges. 

As  a  final  derivation,  rather  than  assume  normality,  we  could  consider  estimators 
that  are  linear  in  y.  Exercise  8.6  shows  that  this  again  leads  to 

bi  =  DziVr\yi-Xi0). 


The  best  linear  predictor  is  therefore  identical  to  the  best  predictor  under  normality. 
For  general  distributions,  E[6,;  |  y,]  will  not  necessarily  be  linear  in  y, . 

Since  we  now  have  a  method  for  predicting  bt,  we  can  examine  fitted  values: 


Yi  =  Xi/3  +  Zibi 


=  x4 3  +  z* 


DzlV-\yi 


=  (In<  -  Wi)xii3  +  Wiyil 


with  Wi  =  ZiDzJVi  ,  so  that  we  have  a  weighted  combination  of  the  population 
profile  and  the  unit’s  data.  If  D  =  0,  we  obtain  Y,  =  xi/3,  and  if  D  is  “small,”  the 
fitted  values  are  close  to  the  population  curve,  which  is  reasonable  if  there  is  little 
between-unit  variability.  If  elements  of  D  are  large,  the  fitted  values  are  closer  to 
the  observed  data. 


Example:  One-Way  AN OVA 

For  the  simple  balanced  ANOVA  model  previously  considered,  the  calculation  of 
E[6j  |  yi,  (3,  S]  results  in 


bi  = 


■nan 


i(Vi 


%) 
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to  give  a  predictor  that  is  a  weighted  combination  of  the  “residual’'  yt  —  /3q  and 
zero.  For  finite  n,  the  predictor  is  biased  towards  zero.  As  n  — >  oo,  bi  —t  y?;  —  /3q, 
so  that  /?o  +  bi  — >  xji,  illustrating  that  the  shrinkage  disappears  as  the  number  of 
observations  on  a  unit  n  increases,  as  we  would  hope. 

8.6  Bayesian  Inference  for  Linear  Mixed  Models 
8.6.1  A  Three-Stage  Hierarchical  Model 

We  consider  the  LMM 


yi  =  Xi(3  +  Zibi  +  e 


with  bi  and  e,  independent  and  distributed  as  b,  \  D  Nq+1(  0,  D),  and  et  \ 

of  ~ind  Nnt  (  0,  of  I),  i  =  1, . . . ,  m. 

The  second  stage  assumption  for  bi  can  be  motivated  using  the  concept  of 
exchangeability  that  we  encountered  in  Sect.  3.9.  If  we  believe  a  priori  that 
bi , . . . ,  brn  are  exchangeable  (and  are  considered  within  a  hypothetical  infinite 
sequence  of  such  random  variables),  then  it  can  be  shown  using  representation 
theorems  (Sect.  3.9)  that  the  prior  has  the  form 


so  that  the  collection  bm]  are  conditionally  independent,  given  hyperpa¬ 

rameters  <f> ,  with  the  hyperparameters  having  a  distribution  known  as  a  hyperprior. 
Hence,  we  have  a  two-stage  (hierarchical)  prior: 


bi  |  4>  ~ud  p(-\<j>),  i  =  l,...,m 

^ iid  ^(‘)- 


Parametric  choices  forp(-  |  0)  and  7r(-)  are  based  on  the  application,  though  com¬ 
putational  convenience  may  also  be  a  consideration  (as  we  discuss  in  Sect.  8.6.3). 
We  initially  consider  the  multivariate  normal  prior  N9+i  (  0.  D)  so  that  <j>  =  D.  The 
practical  importance  of  this  representation  is  that  under  exchangeability  the  beliefs 
about  each  of  the  unit-specific  parameters  must  be  identical.  For  example,  for  the 
dental  data,  if  we  do  not  believe  that  the  individual-specific  deviations  from  the 
average  intercepts  and  slopes  for  boys  and  girls  are  exchangeable,  then  we  should 
consider  separate  prior  specifications  for  each  gender.  In  general,  if  collections  of 
units  cluster  due  to  an  observed  covariate  that  we  believe  will  influence  bi,  then 
our  prior  should  reflect  this.  This  framework  contrasts  with  the  sampling  theory 
approach  in  which  the  random  effects  are  assumed  to  be  a  random  sample  from  a 
hypothetical  infinite  population. 
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The  three-stage  model  is 
Stage  One:  Likelihood: 

PiVi  I  P,bi,ol),  i  = 

Stage  Two:  Random  effects  prior: 

p{bi  |  D),  i  =  1, . . .  ,m. 

Stage  Three:  Hyperprior: 


8.6.2  Hyperpriors 

It  is  common  to  assume  independent  priors: 

tt(/3,  D ,  of)  =  7r(/3)7r(D)7r(o-2). 

A  multivariate  normal  distribution  or  / 3  and  an  inverse  gamma  distribution  for  of  are 
often  reasonable  choices,  since  they  are  flexible  enough  to  reflect  a  range  of  prior 
information.  The  data  are  typically  informative  on  (3  and  of  also.  These  choices 
also  lead  to  conditional  distributions  that  have  convenient  forms  for  Gibbs  sampling 
(Sect.  3.8.4).  The  prior  specification  for  D  is  less  straightforward. 

If  D  is  a  diagonal  matrix  with  elements  of,  k  =  0.  1 .... .  q,  then  an  obvious 
choice  is 

9 

tt(oo>  ■  ■  ■ ,  cr2)  =  JJ  IGa(ofe,  bk), 
k= 0 

where  IGa(afc,6fe)  denotes  the  inverse  gamma  distribution  with  prespecified  pa¬ 
rameters  Ofc,6fc,  k  =  0, ....  q.  These  choices  also  lead  to  conjugate  conditional 
distributions  for  Gibbs  sampling.  Other  choices  are  certainly  possible,  however,  for 
example,  those  contained  in  Gelman  (2006).  A  prior  for  non-diagonal  D  is  more 
troublesome;  there  are  (g  +  2)  (g  + 1)/2  elements,  with  the  restriction  that  the  matrix 
of  elements  is  positive  definite.  The  inverse  Wishart  distribution  is  the  conjugate 
choice  and  is  the  only  distribution  for  which  any  great  practical  experience  has  been 
gathered. 

We  digress  to  describe  how  the  Wishart  distribution  can  be  motivated.  Suppose 
Zi, ... ,  Zr  ~iid  Np(  0,  S),  with  S  a  non-singular  variance-covariance  matrix, 
and  let 

W  =  j^Z,Z). 

i=  1 


(8.33) 
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Then  W  follows  a  Wishart  distribution,  denoted  Wish?,(r.  5),  with  probability 
density  function 

p(w)  =  C-1  |  W  |  (r-p-l)/2  exp 

where 

c=  2rp/2rp(r/2)  |  S  |r/2,  (8.34) 

with 

rp(r/2)  =  ^-1)/4nr[(r  +  l-j)/2] 

i=i 

the  generalized  gamma  function.  We  require  r  >  p  —  1  for  a  proper  density.  The 
mean  is 

E[W]  =  rS. 

Taking  p  =  1  yields 

pH  =  ^/2)  wt'/2~1  exp(— tn/25), 

for  w  >  0,  revealing  that  the  Wishart  distribution  is  a  multivariate  version  of  the 
gamma  distribution,  parameterized  as  Ga[r/2, 1  / (25*)] .  Further,  taking  5=1  gives 
a  xl  random  variable,  which  is  clear  from  (8.33). 

If  W  ~  Wishp(r,  5),  the  distribution  of  D  =  W^1  is  known  as  the  inverse 
Wishart  distribution,  denoted  InvWishp(r,  S),  with  density 


— tr( wS 
2  ^  ; 


p(d)  =c~1\d  |-(’-+p+i)/2  exp 


-^tr  (d-'S) 


where  c  is  again  given  by  (8.34).  We  denote  this  random  variable  by  D  in 
anticipation  of  subsequently  specifying  an  inverse  Wishart  distribution  as  prior  for 
the  variance-covariance  matrix  of  the  random  effects  D.  The  mean  is 


E  [D] 


s -1 

r  —  p  —  1 


and  is  defined  for  r  >  P+  1.  If  p  =  1,  we  recover  the  inverse  gamma  distribution 
IGa(r/2, 1/25)  with 


E  [D\ 


1 

S(r  ~  2) 

1 


var(ID) 


52(r  —  2)(r  —  4)  ’ 
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so  that  small  r  gives  a  more  dispersed  distribution  (which  is  true  for  general  p). 
One  way  of  thinking  about  prior  specification  is  to  imagine  that  the  prior  data  for 
the  precision  consists  of  observing  r  multivariate  normal  random  variables  with 
empirical  variance-covariance  matrices  R  =  S~1.  See  Appendix  D  for  further 
properties  of  the  Wishart  and  inverse  Wishart  distributions. 

Returning  from  our  digression,  within  the  LMM,  we  specify  W  =  D  1  ~ 
Wq+i(r,  R^1)  where  we  have  taken  S  =  R^1  to  aid  in  prior  specification.  We 
require  choices  for  r  and  R.  Since 


E  [D}  = 


R 


r  —  q  —  2  ’ 


R  may  be  scaled  to  be  a  prior  estimate  of  D,  with  r  acting  as  a  strength  of  belief  in 
the  prior,  with  large  r  placing  more  mass  close  to  the  mean. 

One  method  of  specification  that  attempts  to  minimize  the  influence  of  the  prior 
is  to  take  r  =  q  +  3  the  smallest  integer  that  gives  a  proper  prior  to  give  E[Z?]  =  R, 
as  the  prior  guess  for  D.  We  now  describe  another  way  of  specifying  a  Wishart 
prior,  based  on  Wakefield  (2009b).  Marginalization  over  D  gives  b,  as  multivariate 
Student’s  t  with  location  0,  scale  matrix  R/ (r  —  p  +  1),  and  degrees  of  freedom 
d  =  r  —  q  +  2.  The  margins  of  a  multivariate  Student’s  t  are  t  also,  which  allows 
r  and  R  to  be  chosen  via  specification  of  an  interval  for  the  yth  element  of  (>,  ,  b,r 
Specifically,  blf  follows  a  univariate  Student’s  t  distribution  with  location  0,  scale 
Rjj/ (r  —  q  +  2),  and  degrees  of  freedom  d  =  r  —  q.  For  a  required  range  of  [— V,  V] 
with  probability  0.95,  we  use  the  relationship  ±t$  025\/rDjj  =  ±V,  where  tp  is 
the  100  x  pth  quantile  of  a  Student’s  t  random  variable  with  d  degrees  of  freedom. 
Picking  the  smallest  integer  that  results  in  a  proper  prior  gives  r  =  q  +  1  so  that 
d=  1  and  Rjj  =  V2d/2(tf_{1_p)/2)2. 

As  an  example  of  this  procedure,  consider  a  single  random  effect  ( q  =  0). 
We  specify  a  Ga[r/2, 1/(25)]  prior  for  a//2,  so  that  marginally,  bi  is  a  Student’s 
t  distribution  with  location  0,  scale  r/S,  and  degrees  of  freedom  r.  The  above 
prescription  gives  r  =  1  and  5  =  i^)2 /V2.  In  the  more  conventional 

Ga(a,  b)  parameterization,  we  obtain  a  =  0.5  and  b  =  V2 /[2(tf_^1_p^2)2].  For 
example,  for  the  dental  data,  if  we  believe  that  a  95%  range  for  the  intercepts,  about 
the  population  intercept,  is  ±V  =  ±0.2,  we  obtain  the  choice  Ga(0.5,  0.000124) 
for  ctq2.  This  translates  into  a  prior  for  do  (which  is  more  interpretable)  with  5%, 
50%,  and  95%  points  of  [0.008,  0.023,  and  0.25].  An  important  point  to  emphasize 
is  that  within  the  LMM,  a  proper  prior  is  required  for  D  to  ensure  propriety  of  the 
posterior  distribution. 

A  weakness  with  the  Wishart  distribution  is  that  it  is  deficient  in  second  moment 
parameters,  since  there  is  only  a  single  degrees  of  freedom  parameter  r.  So,  for 
example,  it  is  not  possible  to  have  differing  levels  of  certainty  in  the  tightness  of 
the  prior  distribution  for  different  elements  of  D.  This  contrasts  with  the  situation 
in  which  D  is  diagonal,  and  we  specify  independent  inverse  gamma  priors,  which 
gives  separate  precision  parameters  for  each  variance. 
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Fig.  8.2  Prior  summaries  for  the  prior  D  1  ~  W2(r,  R  x)  with  r  =  4  and  R  containing 
elements  [1.0,  0,  0,  0.1].  Univariate  marginal  densities  for  (a)  ag,  (b)  p,  (c)  <ti,  and  the  bivariate 
density  for  (d)  (ag,  a\) 


Figure  8.2  displays  summaries  for  an  example  with  a  2  x  2  variance-covariance 
matrix  (so  that  q  =  1).  We  assume  D  1  ~  W-2  (r,  R  " 1 )  with  r  =  4  and  E[£)]  = 

1.0  0  n 


R 

4-1-2 


=  R  with  R  = 


0  0.1 


We  summarize  samples  from  the  Wishart  via 


marginal  distributions  for  cr0,  a i,  and  p  since  these  are  more  interpre table.  These 
plots  were  obtained  by  simulating  samples  for  D  1  from  the  Wishart  prior  and  then 
converting  these  samples  to  the  required  functions  of  interest.  Finally,  we  smooth 
the  sample  histograms  and  scatter  plots  to  produce  Fig.  8.2.  As  we  would  expect,  the 
prior  on  the  correlation  is  symmetric  about  0.  Examination  of  intervals  for  er0 ,  o\  can 
inform  on  whether  we  believe  the  prior  is  suitable  for  any  given  application.  Going 
one  step  further,  we  could  then  simulate  random  effects  from  the  zero  mean  normal 
with  variance  D,  the  latter  being  a  draw  from  the  prior;  we  might  also  continue  to 
simulate  data,  though  this  would  require  draws  from  the  other  priors  too. 
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8.6.3  Implementation 

For  simplicity,  we  suppose  that  Xj  =  zL.  It  is  convenient  in  what  follows  to 
reparameterize  in  terms  of  the  set  \Pl, . . . ,  /3m,  r,  /3,  EE]  where  /3i  =  /3  +  bi, 
t  =  <J~2,  and  EE  =  D  1 .  The  joint  posterior  is 

m 

P{P 1,  •  •  •  ,  Pm’  G  I  V)  OC  ]^[  \p(Vi  I  Pi’  T)p(Pi  I  P>  W")]  ^{P)TT{.T)Tr(W), 

1=1 


with  priors: 

P  ~  N9+1(/30,  Vq),  t  ~  Ga(a0, 60),  IE  ~  W?+1(r,  iT1). 


Marginal  distributions,  and  summaries  of  these  distributions,  are  not  available  in 
closed  form.  Various  approaches  to  obtaining  quantities  of  interest  are  available. 
The  INLA  procedure  described  in  Sect.  3.7.4  is  ideally  suited  to  the  LMM.  As  an 
alternative,  we  describe  an  MCMC  strategy  using  Gibbs  sampling  (Sect.  3.8.4).  The 
required  conditional  distributions  are 

•  p(P\T,W,(31,...,f3m,y). 

*  p(r  |  (3,  EE ,(31} . . .  ,Pm,y). 

*  p(Pi  I  P,r,W,y),i  =  1 

•  p{W\P,T,P1,...,Pm,y). 

where  we  block  update  /3,  EE,  and  j3i  to  reduce  dependence  in  the  Markov  chain. 

The  conditional  distributions  for  /3,  t,  and  (3i  are  straightforward  to  derive 
(Exercise  8.10)  and  are  given,  respectively,  by 


N, 


<3+1 


*=  1 


f3\P1,...,Pm,W  ^  I  P .  W)n(0) 

(mW  +  V,-1)-1  +  , 

(mW  +  V^y 

m 

Pt,y  oc  ~[[p(.yi  |  Pi,T)n(T) 

m 

l=}  n% ,  b0  +  ^J2(yz-  XiPtYivi  -  x^) 


Ga 


a  o 


i=  1 


Pi  I  r,  W,  y  cx  Y[  p(yi  |  Pt,  T)p(Pi  |  P,  EE) 


i=  1 


N9+i  [(tx^x;  +  EE)  VaiJj/i  +  EE/3),  (rx’xj+EE) 


-11 
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Conditional  independencies  have  been  exploited,  and  in  each  case,  the  notation 
explicitly  conditions  on  only  those  parameters  on  which  the  conditional  distribution 
depends.  For  example,  to  derive  the  conditional  distribution  for  f3,  we  only  require 
[/3-l, . . .  ,/3m]  and  W.  The  conditional  for  f3,-  is,  once  we  reparameterize,  identical 
to  the  empirical  Bayes  estimates  derived  for  the  random  effects  in  Sect.  8.5.4 
(Exercise  8.11).  This  comparison  illustrates  how  the  uncertainty  in  (3  and  a  = 
[r,  W]  is  accounted  for  across  iterates  of  the  Gibbs  sampler. 

Deriving  the  conditional  distribution  for  W  is  a  little  more  involved.  First,  note 
that 


((3t  -  (3YW(f3l  -  (3)  =  &[(&  -  (3YW (ft  -  (3)\  =  tr [W(f3t  -  /3)((3t  -  /3)T]. 

Then 


W\0i,0x  |  W)  x  7r(W) 

i=l 


oc  I  W 


=  I  W 


|(m+r-g-l-l)/2 


|(m+r-q-l-l)/2 


6XP  '  _  2 


^09i-/3)TW09i-/3)+tr(WJl) 


exP)  ~2tr  I  W 


YJ{(3x-f3Y(3i-f3Y  +  R 


to  give  the  conditional  distribution 


W\p1,...,Pm,(3^Wq+1 


-1 


n  +  YsiPi-Pm-PY 


i= 1 


This  illustrates  how  r  and  R  are  comparable  to  rn  and  the  between-unit  sum  of 
squares,  respectively,  which  aids  in  prior  specification.  Since 


E[D\p1,...,f3rn,/3] 


R  +  J2T=i  (Pi  -P)(Pi~PY 

r  +  m  —  q  —  2 


the  form  of  the  conditional  distribution  suggests  that  it  is  better  to  err  on  the  side  of 
picking  R  too  small,  since  a  large  R  will  always  dominate  the  sum  of  squares.  If  m 
is  small,  the  prior  is  always  influential. 

If  we  collapse  over  /3i;  i  =  1, . . . ,  to,  we  obtain  the  two-stage  model  with 

Stage  One:  Marginal  likelihood: 


V  \{3,T,W~nN(x/3,V), 


where  V  =  V{W,t). 
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Stage  Two:  Priors: 

An  MCMC  algorithm  iterates  between 

•  p(P  |  y,W,r ) 

•  p(t  |  y,P,W) 

•  p(W\y,P) 

This  approach  is  appealing  since  it  is  over  a  reduced  parameter  space,  but  the  form  of 
p(\V  |  y.  P,  t )  is  extremely  awkward.  The  conditional  for  P  offers  some  intuition 
on  the  Bayesian  approach,  however.  Specifically,  writing  a  =  [r,  W],  we  obtain 
the  conditional  distribution: 

P\y,a~  N9+i  [E (P  I  y,  a),  var(/3  |  y,  ex)  ] 

where  the  mean  and  variance  can  be  written  in  the  weighted  forms 

E[/3  \  y,a]  =  w  xPa  +  (I  —  w)  x  P0  (8.35) 

var(/3  |  y,  a)  =  w  x  var(/30).  (8.36) 

Here,  /3G  =  (a;TV_1a:)_1a:T1H_1y  is  the  GLS  estimator  with  variance  var(/3G)  = 
(a;71^-1®)-1,  and  the  (q  +  1)  x  (q  +  1)  weight  matrix  is 

w  =  {x'V^x  +  Vr0_1)~1ccTV_1a:. 

As  the  prior  becomes  more  diffuse,  that  is,  as  Tg_1  —>  0,  the  weight  w  — >  I,  the 
conditional  posterior  mean  approaches  the  GLS  estimator  PG,  and  the  conditional 
posterior  variance  approaches  var(/3G).  In  contrast,  as  V-1  — >  0,  so  that  the 
prior  becomes  more  concentrated  about  P0,  w  — ►  0  and  the  conditional  posterior 
moments  approach  the  prior  distribution.  Since 

E IP  |  y]  =  Ea|y  [E (P  I  y,a)], 

the  posterior  mean  is  the  conditional  posterior  mean  averaged  over  a  \  y.  As  is 
typical,  the  Bayesian  estimate  integrates  over  a,  while  the  GLS  estimator  conditions 
on  a  for  evaluation  of  V.  We  would  expect  likelihood  and  Bayesian  point  and 
interval  estimates  to  be  similar  for  large  samples  because  the  posterior  a  \  y  will 
become  increasingly  concentrated  about  S. 


8. 6. 4  Extensions 

Computationally,  under  a  Bayesian  approach  via  MCMC,  it  is  relatively  straight¬ 
forward  to  extend  the  basic  LMM.  The  conditional  distributions  may  not  be  of 
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conjugate  form,  but  Metropolis-Hastings  steps  can  be  substituted  (Sect.  3.8.2).  For 
example,  great  flexibility  in  the  distributional  assumptions  and  error  models  is 
available,  though  prior  specification  will  usually  require  greater  care.  To  automati¬ 
cally  protect  against  outlying  measurements/individuals,  Student’s  t  errors  may  be 
specified  at  stage  one/stage  two  of  the  hierarchy,  though  when  regression  is  the 
focus  of  the  analysis,  the  greatest  effort  should  be  concentrated  upon  specifying 
appropriate  mean-variance  relationships  at  the  two  stages. 

With  the  advent  of  MCMC,  there  is  a  temptation  to  fit  complex  models  that 
attempt  to  reflect  every  possible  nuance  of  the  data.  Flowever,  the  statistical  prop¬ 
erties  of  complex  models  (such  as  consistency  of  estimation  under  incorrect  model 
specification)  are  difficult  to  determine,  as  are  the  implied  marginal  distributions  for 
the  data  (which  can  aid  in  model  assessment).  Overfitting  is  also  always  a  hazard. 
Consequently,  caution  should  be  exercised  in  model  refinement.  One  of  the  arts  of 
statistical  analysis  is  deciding  on  when  model  refinement  is  warranted. 


Example:  Dental  Growth  Curves 

We  analyze  the  data  from  the  m  =  11  girls  only  and  adopt  the  following  three- stage 
hierarchical  model: 

Stage  One:  As  likelihood,  we  assume 


Uij  —  Ao  +  Piltj  +  £ij, 


with  ei:j  |  t  ~iid  N(0,  r  1),  j  =  1, . . . ,  4,  i  =  1, . . . ,  11. 
Stage  Two:  Let 


oho  o-f 


.2 


with  random  effects  prior 


A \p,D~N2((3,D),  i  =  1, . . .  ,m. 


Stage  Three:  As  hyperprior,  we  assume 


7t(t,  /3,  D  -1)  =  7r(r)  X  7t(/3)  X  7r(U  x) 


with  improper  priors  on  r  and  /3: 


7 t(t)  oc  r  7t(/3)  oc  1 


and 


D1  ~  W2{r,  R-1). 
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In  the  LMM,  there  is  typically  abundant  information  in  the  data  with  respect  to 
r  and  (3.  By  placing  a  flat  prior  on  (3  (which  are  often  the  parameters  of  interest), 
we  are  also  basing  inference  on  the  data  alone  (in  nonlinear  models,  more  care  is 
required  since  a  proper  prior  is  often  required  to  ensure  propriety  of  the  posterior). 

With  just  1 1  girls,  we  would  expect  inference  for  D  to  be  sensitive  to  the  prior, 
and  so,  we  consider  three  choices  of  r  and  R.  Each  prior  has  the  same  mean  of 


E  [D] 


1.0  0 
0  0.1 


(8.37) 


with  q  =  1  here.  The  above  specification  corresponds  to  an  a  priori  belief  that  the 
spread  of  the  expected  response  at  1 1  years  across  girls  is 


±1.96E[cr0]  w  ±1.96-\/#u  =  ±1.96 


and  the  variability  in  slopes  across  girls  is  expected  to  be 


±1.96E[cr1]  «  ±1.96 y/R^2  =  ±0.62. 


The  exact  intervals  can  be  evaluated  in  an  obvious  fashion  using  simulation.  The 
off-diagonal  of  R  is  0  as  we  assume  there  is  no  reason  to  believe  the  correlation 
between  intercepts  and  slopes  will  be  positive  or  negative. 

The  degrees  of  freedom  r  is  on  the  same  scale  as  m  and  may  be  viewed  as  a  prior 
sample  size.  We  pick  r  =  4,7, 28,  and  to  obtain  the  same  prior  mean,  (8.37),  R  is 
specified  as 


'1.0  0  ' 

o 

o 

'25  0  ' 

0  0.1 

5 

0  0.4_ 

? 

0  2.5 

for  each  of  r  =  4,  7,  28,  respectively.  To  obtain  a  proper  posterior,  we  require  r  >  1. 
We  pick  r  =  4  as  our  smallest  choice  since  the  mean  exists  for  this  value.  Samples 
from  this  prior  are  displayed  in  Fig.  8.2. 

We  present  the  results  in  terms  of  elements  of  D,  for  direct  comparison  with 
the  prior.  If  we  were  reporting  substantive  conclusions,  we  would  choose  do,  <j\, 
p,  or  interval  estimates  for  (3^  =  [/3j*o,  /3j*i],  the  parameters  of  a  new  girl  who  is 
exchangeable  with  those  in  the  study.  Table  8.1  gives  posterior  medians  and  95% 
interval  estimates  for  the  fixed  effects  and  variance  components.  We  see  sensitivity 
to  the  prior  with  respect  to  inference  for  D.  As  r  increases,  the  posterior  medians 
draw  closer  to  the  prior  means  of  1.0  and  0.1.  For  /30  and  ,  the  medians  are  robust 
to  the  prior  specification,  while  the  width  of  the  intervals  for  /30  and  /3\  change  in 
proportion  to  the  behavior  of  erg  and  o\,  respectively.  The  interval  estimates  for 
/3o  narrow,  while  those  for  !3-\  widen,  though  the  changes  are  modest.  With  only  1 1 
subjects,  we  would  expect  sensitivity  to  the  prior  on  D.  For  r  =  7,  the  “total  degrees 
of  freedom”  is  1 8  with  a  prior  contribution  of  7  and  a  data  contribution  of  1 1 . 
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Table  8.1  Posterior  medians  and  95%  intervals  for  fixed  effects  and  variance  components,  under 
three  priors  for  the  dental  growth  data  for  girls 


Prior 

r  =  4 

r  =  7 

r  =  28 

A) 

- 

22.6 

[21.4,23.81 

22.6 

[21.5,23.7] 

22.6 

[21.8,23.5] 

0i 

- 

0.48 

[0.33,0.63] 

0.48 

[0.31,0.65] 

0.48 

[0.28,0.67] 

a0 

1.0 

3.48 

[1.66,8.75] 

2.97 

[1.51,6.63] 

1.78 

[1.14,2.97] 

cr  01 

0.0 

0.13 

[-0.10,0.54] 

0.10 

[-0.14,0.461 

0.04 

[-0.10,0.201 

°i 

0.1 

0.03 

[0.01,0.10] 

0.05 

[0.02,0.12] 

0.08 

[0.05,0.141 

The  population  intercept  is  /3q  and  the  population  slope  is  /3j.  The  variances  of  the  random 
intercepts  and  random  slopes  are  oq  and  erf,  respectively,  and  the  covariance  between  the  two 
is  ctoi 


8.7  Generalized  Estimating  Equations 
8.7.1  Motivation 

We  now  describe  the  GEE  approach  to  modeling/inference.  GEE  attempts  to  make 
minimal  assumptions  about  the  data-generating  process  and  is  constructed  to  answer 
population-level,  rather  than  individual-level,  questions.  There  are  some  links  with 
the  quasi-likelihood  approach  described  in  Sect.  2.5  in  that,  rather  than  specify  a 
full  probability  model  for  the  data,  only  the  first  two  moments  are  specified.  GEE 
is  motivated  by  dependent  data  situations,  however,  and  exploits  replication  across 
units  to  empirically  estimate  standard  errors  through  sandwich  estimation.  GEE 
uses  a  “working”  second  moment  assumption;  “working”  refers  to  the  choice  of  a 
variance  model  that  may  not  necessarily  correspond  to  exactly  the  form  we  believe 
to  be  true  but  rather  to  be  a  choice  that  is  statistically  convenient  (we  elaborate 
on  this  point  subsequently).  Any  discrepancies  from  the  truth  are  corrected  using 
sandwich  estimation  to  give  a  procedure  that  gives  a  consistent  estimator  of  both 
the  regression  parameters  and  the  standard  errors  (so  long  as  we  have  independence 
between  individuals). 

We  assume  the  marginal  mean  model 


E[Yi]  =  Xi(3, 

and  consider  the  rii  x  nr  working  variance-covariance  matrix: 


var(Yi)  =  Wi  (8.38) 

with  co v{Yi,Yif)  =  0  for  i  i\  so  that  observations  on  different  individuals 
are  assumed  uncorrelated.  To  motivate  GEE,  we  begin  by  assuming  that  Wi  is 
known  and  does  not  depend  on  unknown  parameters.  In  this  case  the  GLS  estimator 
minimizes 

m 

YjXi-XiPYWr^Yi-Xip), 

2=1 
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and  is  given  by  the  solution  to  the  estimating  equation 

m 

^2x]Wr1(Yi-xS)=  0, 

2=1 

which  is 

(m  \  ~  1  m 

YtXiWr'xA  E^r1^- 

i= 1  /  i=  1 

We  have  E[/3]  =  /3,  and  if  the  information  about  (3  grows  with  increasing  m,  then 
f3  is  consistent.  The  vital  observation  is  that  f3  is  a  consistent  estimator  for  any  fixed 
W  =  diag(W[ , . . . ,  Wm).  The  weighting  of  observations  by  the  latter  dictates  the 
efficiency  of  the  estimator  but  not  its  consistency.  The  variance,  var(/3),  is 

(m  \  — ^  /  m 

E  J  (  E  xlW-hnWW-'xt 

(8.39) 

If  the  assumed  variance-covariance  matrix  is  substituted,  that  is,  var(T^)  =  W,  , 
then  we  obtain  the  model-based  variance 

m 

J2xiwrlxi 

.»= i 

A  Gauss-Markov  theorem  shows  that,  in  this  case,  the  estimator  is  efficient  amongst 
linear  estimators  if  the  variance  model  (8.38)  is  correct  (Exercise  8.6).  The  novelty 
of  GEE  is  that  rather  than  depend  on  a  correctly  specified  variance  model,  sandwich 
estimation,  via  (8.39),  is  used  to  repair  any  deficiency  in  the  working  variance 
model. 


8. 7.2  The  GEE  Algorithm 

We  now  suppose  that  var(T^)  =  Wi(ot)  where  a  are  unknown  parameters  in  the 
variance-covariance  model.  A  common  approach  is  to  assume 


Wi{a)  =  at  Ri{a.2), 

where  ai  =  var (Yij)  is  the  variance  of  the  response,  for  all  i  and  j,  and  Ri(u 2) 
is  a  working  correlation  matrix  that  depends  on  parameters  a 2.  There  are  a  number 
of  choices  for  Ri,  including  independence,  exchangeable  and  AR(1)  models  (as 
described  in  Sect.  8.4.2).  For  known  a,  f3  is  the  root  of  the  estimating  equation 
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G{(3)  =  ^sJWf1(a)(yi  -  *  i/3)  =  0.  (8.40) 

i=i 


When  a:  is  unknown,  we  require  an  estimator  S  that  converges  to  “something” 
so  that,  informally  speaking,  we  have  a  stable  weighting  matrix  W (S)  in  the 
estimating  equation. 

The  sandwich  variance  estimator  is 


i=  1 


(3)  =  (  X)  E  *IW-1var(VI)  WT1*.  ]T 


i=l 


i=l 


(8.41) 


where  Wi  =  Wj  (a)  and  var(  Y, )  is  estimated  by  the  variance-covariance  matrix  of 
the  residuals: 


(Yi-XiPKYi-XiPy. 


(8.42) 


This  produces  a  consistent  estimate  of  var(/3),  so  long  as  we  have  independence 
between  units,  that  is,  cov(Yi,Yi>)  =  0  for  i  ^  i’ .  It  is  the  replication  across 
units  that  produces  consistency,  and  so,  the  approach  cannot  succeed  if  we  have  no 
replication.  Exercise  8.12  shows  that  we  cannot  estimate  var(Y)  using  the  analog 
of  (8.42)  when  there  is  dependence  between  units. 

For  inference,  we  may  use  the  asymptotic  distribution 

w(3)~1/2(3-/3)~N,+1(0,I), 


where  we  emphasize  that  the  asymptotics  are  in  the  number  of  units,  to.  The 
variance  estimator  is  sometimes  referred  to  as  robust,  but  empirical  is  a  more 
appropriate  description  since  the  form  can  be  highly  unstable  for  small  to. 

In  the  most  general  case  of  working  variance  model  specification,  we  may  allow 
the  working  variance  model  to  depend  on  (3  also,  so  that  we  have  W,(oi,  (3) 
to  allow  mean-variance  relationships.  For  example,  in  a  longitudinal  setting,  the 
variance  may  depend  on  the  square  of  the  marginal  mean  with  an  autoregressive 
covariance  model: 

var  (Yij)  =  aip^ 

co  v(Yij,Yik)  =  ona^-^PijPik 
cov(3^,Yi,fc)  —  0,  i  7^  i 

with  j  =  1, . . . ,  rii,  k,  k!  =  1, . . . ,  ri,;<  and  where  tij  is  the  time  associated  with 
response  Ytj .  In  this  model,  a\  is  the  component  of  the  variance  that  does  not  depend 
on  the  mean  (and  is  assumed  constant  across  time  and  across  individuals),  a-2  is  the 
correlation  between  responses  on  the  same  individual  which  are  one  unit  of  time 
apart  and  a  =  [a\ ,  a2]-  In  general  the  roots  of  the  estimating  equation 
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J2x]W-1(a,f3)(Yl-xl(3)=  0  (8.43) 

i=  1 

are  not  available  in  closed  form  when  (3  appears  in  W. 

We  can  write  the  (fc+l)  x  1  estimating  function  in  a  variety  of  forms,  for  example: 

xTW~\Y  -x/3) 

m 

i=  1 

m  rii  rti 

xijW?k(Yik  -  Xik/3) 

1=1  j—1  k=  1 

where  W-  J  denotes  entry  (i,  j)  of  W~ 1 .  We  will  often  use  the  middle  form, 
since  this  emphasizes  that  the  basic  unit  of  replication  (upon  which  the  asymptotic 
properties  depend)  is  indexed  by  i. 

The  GEE  approach  is  constructed  to  carry  out  marginal  inference,  and  so  we 
cannot  perform  individual-level  inference.  For  a  linear  model,  marginalizing  a 
LMM  produces  a  marginal  model  identical  to  that  used  in  a  GEE  approach.  As 
a  consequence,  parameter  interpretation,  as  discussed  in  Sect.  8.4.3  in  the  marginal 
setting,  is  identical  in  the  LMM  and  in  GEE.  When  nonlinear  models  are  considered 
in  Chap.  9  there  is  no  equivalence  and  the  differences  between  the  conditional  and 
marginal  approaches  to  inference  becomes  more  pronounced.  For  the  linear  model, 
sandwich  estimation  may  be  applied  to  the  MLE  of  (3. 

So  far,  as  the  choice  of  “working”  correlation  structure  is  concerned,  we  en¬ 
counter  the  classic  efficiency/robustness  trade-off.  If  we  choose  a  simple  structure, 
there  are  few  elements  in  a  to  estimate,  but  there  is  a  potential  loss  of  efficiency. 
A  more  complex  model  may  provide  greater  efficiency  if  the  variance  model  is 
closer  to  the  true  data-generating  mechanism  but  more  instability  in  estimation  of 
a.  Clearly,  this  choice  should  be  based  on  the  sample  size,  with  relatively  sparse 
data  encouraging  the  use  of  a  simple  model. 

We  summarize  the  GEE  approach  to  modeling/estimation  when  the  working 
variance  model  depends  on  a.  and  not  on  (3.  The  steps  of  the  approach  are: 

1.  Specification  of  a  mean  model,  E[Yi]  =  Xif3. 

2.  Specification  of  a  working  variance  model,  var(  Y, )  =  VE,  («  ). 

3.  From  (1)  and  (2),  an  estimating  function  is  constructed,  and  sandwich  estimation 
is  applied  to  the  variance  of  the  resultant  estimator. 

In  general,  iteration  is  needed  to  simultaneously  estimate  (3  and  a.  Let  ot{v>j  be  an 
initial  estimate,  set  t  =  0,  and  iterate  between: 

1.  Solve  G((3,  a =  0,  with  G  given  by  (8.40),  to  give  (3^ 

2.  Estimate  a('t+1'1  based  on  (3^  \ 


Set  t  — >  t  +  1,  and  return  to  1. 
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Example:  Linear  Regression 

We  illustrate  the  use  of  a  working  variance  assumption  in  an  independent  data 
situation.  Suppose 

E  [Yi]  =  Xip, 

for  i  =  1, ...  ,n.  Under  the  working  independence  variance  model,  var( Y )  =  ol, 
the  OLS  estimator 

(3  =  {xTx)~1xTY 

is  recovered.  The  sandwich  form  of  variance  estimate  is 

var(/3)  =  (xTx)~1xT\ai(Y)x(xTx)~1 .  (8.44) 

Assuming  the  working  variance  is  “true”  gives  the  model-based  estimate 

var(/3)  =  ( xTx)~1a , 


and  a  may  be  estimated  by 


a  = 


1 

n  —  k  —  1 


-  Xipf, 

i=l 


which  is  formerly  equivalent  to  quasi-likelihood.  If  we  replace  var(F)  in  (8.44) 
by  a  diagonal  matrix  with  diagonal  elements  (F,  —  x,P)'2.  then  we  obtain  a 
variance  estimator  that  protects  (asymptotically)  against  errors  with  nonconstant 
variance.  We  cannot  protect  against  correlated  outcomes,  however,  since  there  is 
no  replication. 


8. 7.3  Estimation  of  Variance  Parameters 

To  formalize  the  estimation  of  a,  we  may  introduce  a  second  estimating  equation. 
In  the  context  of  data  with  /i,,-  =  E [Fy]  and  var(Fj)  cx  v(ptj),  we  define  residuals 
Rij  =  Yij  —  x.ijP.  Recall  that  /3  is  a  (fc  +  1)  x  1  vector  of  parameters,  and  suppose 
a  is  an  r  x  1  vector  of  variance  parameters.  We  then  consider  the  pair  of  estimating 
equations: 


G1((3,a)  =  Y,xlWr1(Yi-x4 3)  (8.45) 

i= 1 
m 

G2(f3,a)  =  Y/ElH-1[Ti-Zi(cx)} 

i= 1 


(8.46) 
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where  the  “data”  in  the  second  estimating  equation  are 


,2 

-il  i  * 


an  [rii  +  rii{ni  —  l)/2] -dimensional  vector  with 


Ei{a)  =E[Ti] 


a  model  for  the  variances  of,  and  correlations  between,  the  residuals.  In  (8.46),  Ei  = 
dEi/da  is  the  [ni+rii(ni  — 1)/2]  x  r  vector  of  derivatives,  and  Hi  =  cov(Ti)  is  the 
[rii  4- rii(ni  —  1) / 2]  x  [ni  +  ni(rii  —  1) / 2]  working  covariance  model  for  the  squared 
and  cross  residual  terms.  If  G2  is  correctly  specified,  then  there  will  be  efficiency 
gains.  A  further  advantage  of  this  approach  is  that  it  is  straightforward  to  incorporate 
a  regression  model  for  the  variance-covariance  parameters,  that  is,  a  =  g(x),  for 
some  link  function  g(-).  For  general  H ,  we  will  require  the  estimation  of  fourth 
order  statistics,  that  is,  var(T),  which  is  a  highly  unstable  endeavor  unless  we  have 
an  abundance  of  data.  For  this  reason,  working  independence,  Hj  =  I,  is  often  used. 

If  E[T]  E,  then  we  will  not  achieve  consistent  estimation  of  the  true  variance 
model  but,  crucially,  consistency  of  f3  through  G i  is  guaranteed,  so  long  as  a 
converges  to  “something.”  We  reiterate  that  a  consistent  estimate  of  var(/3)  is 
guaranteed  through  the  use  of  sandwich  estimation,  so  long  as  units  are  independent. 

As  an  illustration  of  the  approach,  assume  for  simplicity  m  =  n  =  3  so  that 


TJ  —  [RaRi2,  RnRi3,  RaRa,  R-n  -  R'a R-t 3]- 


With  an  exchangeable  variance  model: 


Ei(a)T  —  E[T/]  —  [aiOf2,  &1&2,  ol \Oli-,  oci,ai,  ai] 


so  that  ai  is  the  marginal  variance,  and  a-j  is  the  correlation  between  observations 
on  the  same  unit.  With  Hi  =  I,  that  is,  a  working  independence  model  for  the 
variance  parameters,  the  estimating  function  for  o:  is 


^  RiiRa  a.\a.2  ^ 

RnRtf  aia2 


m 


R2i2  ai 


ni2  cii 

VL  ri  \  v  «i  \) 
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We  therefore  need  to  simultaneously  solve  the  two  equations: 


i=  1 


Rij Rik  QlQ?2 

j<k 


+ ~  ai)  —  o 

j= i 


2=1 


Rij  Rik  tt| 

j<k 


=  0. 


Dividing  the  second  of  these  by  a±  shows  that 


ala2  —  Rij  Rik 


3m 

2=1  j<k 

and  substituting  this  into  the  first  equation  gives 

1  m 

3>  =  ^SE 

2=1 j<k 

to  yield  a  pair  of  method  of  moments  estimators. 


Example:  Dental  Growth  Curve 

We  use  a  GEE  approach  with  the  marginal  model: 

E  [Yij]  =  Xij/3 , 

and  interactions  so  that 

_  f  [1,  tj,  0, 0  ]  for  i  =  1, . . . ,  16 
UMj-,1  ,tj]  fort  =  17,...,  27, 

where  j  =  1, 2,  3, 4  and  [ti,t2,t3,  fq]  =  [—2,  —1, 1, 2].  Table  8.2  summarizes  anal¬ 
yses  with  independence  and  exchangeable  working  correlation  models,  including 
standard  errors  under  the  assumption  that  the  working  model  is  correct  (the  “model” 
standard  errors)  and  under  sandwich  estimation. 

The  point  estimates  and  model-based  standard  errors  under  working  indepen¬ 
dence  always  correspond  to  those  from  an  OLS  fit.  The  point  estimates  under  the  two 
working  models  are  also  identical  here  due  to  the  balanced  design.  This  agreement 
will  not  hold  in  general.  The  marginal  variance  is  estimated  as  2.26,  and  the 
correlation  parameter  under  the  exchangeable  model  as  0.6 1 .  These  are  in  very  close 
agreement  with  the  equivalent  values  of  2.28  and  0.63  obtained  from  the  random 
intercepts  LMM.  As  we  would  expect  for  these  data,  the  model-based  and  sandwich 
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Table  8.2  Summaries  for  the  dental  growth  data  of  fixed  effects  from  GEE  analyses,  under 
independence  and  exchangeable  working  correlation  matrices;  0o  and  0 1  are  the  population 
intercept  and  population  slope  for  boys  and  0o  +  02  and  0\  +  0s  are  the  population  intercept 
and  population  slope  for  girls 

Independence  Exchangeable 


Standard  error  Standard  error 


Estimate 

Model 

Sandwich 

Estimate 

Model 

Sandwich 

A> 

25.0 

0.28 

0.44 

25.0 

0.47 

0.44 

0i 

0.78 

0.13 

0.098 

0.78 

0.079 

0.098 

@2 

-2.32 

0.44 

0.75 

-2.32 

0.74 

0.75 

03 

-0.31 

0.20 

0.12 

-0.31 

0.12 

0.12 

standard  errors  are  quite  similar  under  the  exchangeable  working  model,  because  we 
have  seen  that  the  empirical  estimates  of  the  second  moments  are  close  to  those  of  an 
exchangeable  correlation  structure.  In  contrast,  the  working  independence  standard 
errors  change  quite  considerably.  The  sandwich  standard  errors  are  larger  for  the 
time  static  intercepts  and  smaller  for  the  parameters  associated  with  time  (the  two 
slopes). 

Likelihood  inference  for  a  LMM  with  random  intercepts  and  slopes  produced 
identical  point  estimates  to  those  in  Table  8.2  and  standard  errors  of  [0.49,  0.086, 
0.76,  0.14],  which  are  in  reasonable  agreement  with  the  sandwich  standard  errors 
reported  in  the  table. 


Example:  Dental  Data,  Reduced  Dataset 

In  the  dental  example  the  balanced  design  and  relative  abundance  of  data  leads  to 
summaries  that  might  suggest  that  the  alternative  methods  we  have  described  are 
always  in  complete  agreement.  To  correct  this  illusion,  we  now  report  summaries 
from  an  artificially  created  dental  growth  curve  data  set  in  which  it  is  assumed  that 
children  randomly  drop  out  of  the  study  at  some  point  after  the  first  measurement. 
This  yielded  the  data  in  Fig.  8.3  with  39  measurements  on  boys  (previously  there 
were  64)  and  25  on  girls  (previously  there  were  44). 

We  analyze  these  data  using  GEE  and  LMMs,  the  latter  via  likelihood  and 
Bayesian  approaches  to  inference.  For  GEE,  we  implement  independence  and 
exchangeable  working  correlation  structures.  Table  8.3  gives  point  estimates  along 
with  uncertainty  measures.  For  GEE,  we  report  sandwich  standard  errors,  for  the 
likelihood  LMM  model-based  standard  errors  and  for  the  Bayes  LMM  posterior 
(model-based)  standard  deviations.  The  posterior  distributions  for  the  regression 
parameters  were  close  to  normal,  with  interval  estimates  based  on  a  normal 
approximation  virtually  identical  to  those  based  directly  on  samples  from  the 
posterior.  For  the  Bayesian  analysis,  we  used  a  flat  prior  on  (3,  and  the  Wishart 
prior  for  D  1  had  prior  mean  (8.37),  with  r  =  4. 
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Fig.  8.3  Distance  versus  age  for  reduced  dental  data 


Table  8.3  Summaries  for  the  reduced  dental  growth  data  of  fixed  effects  from  GEE  under 
independent  and  exchangeable  working  correlation  matrices  and  likelihood  and  Bayesian  LMMs; 
00  and  0i  are  the  population  intercept  and  population  slope  for  boys  and  0q  +  02  and  0i  +  03  are 
the  population  intercept  and  population  slope  for  girls 


GEE  independence 

GEE  exchangeable 

LMM  likelihood 

LMM  Bayesian 

Est. 

s.e. 

Est. 

s.e. 

Est. 

s.e. 

Est. 

s.d. 

00 

24.9 

0.75 

24.8 

0.63 

24.7 

0.65 

24.8 

0.63 

01 

0.77 

0.20 

0.71 

0.11 

0.70 

0.14 

0.70 

0.16 

02 

-2.70 

1.23 

-2.01 

0.97 

-1.92 

1.04 

-1.98 

1.02 

03 

-0.53 

0.27 

-0.21 

0.15 

-0.17 

0.23 

-0.19 

0.26 
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For  these  data,  none  of  the  analyses  are  completely  satisfactory  since  the  small 
number  of  observations  does  not  give  confidence  in  the  sandwich  standard  errors, 
nor  are  the  data  sufficiently  abundant  to  allow  any  reliable  evaluation  of  assumptions 
for  the  LMM  analyses.  The  exchangeable  standard  errors  appear  too  small  for  the 
slope  parameters,  though  the  point  estimates  are  in  reasonable  agreement  with  their 
LMM  counterparts.  The  GEE  independence  standard  errors  are  more  in  line  with 
the  LMM  analyses,  though  the  point  estimates  are  quite  different  for  3-2  and  fo. 
As  expected  under  these  priors,  the  likelihood  and  Bayes  analyses  are  in  reasonable 
agreement. 


8.8  Assessment  of  Assumptions 
8.8.1  Review  of  Assumptions 

Each  of  the  approaches  to  modeling  that  we  have  described  depend,  to  a  varying 
degree,  upon  assumptions.  To  ensure  that  inference  is  accurate,  we  need  to  check 
that  these  assumptions  are  at  least  approximately  valid.  We  begin  by  reviewing  the 
assumptions,  starting  with  GEE  (since  it  depends  on  the  fewest  assumptions). 

For  GEE,  we  have  the  marginal  mean  model: 

T 

and  working  covariance  var(ei)  =  Wi(a),  i  =  1, . . . ,  m.  The  first  consideration 
is  whether  the  marginal  model  E[YJ;]  =  x  is  appropriate.  In  particular,  one 
must  check  whether  the  model  requires  refinement  by,  for  example,  the  addition  of 
quadratic  terms  or  interactions.  We  may  also  examine  whether  additional  variables, 
such  as  confounders,  are  required  in  the  model.  These  considerations  are  common 
to  all  approaches.  If  the  mean  model  is  inadequate,  but  all  other  assumptions  are 
satisfied,  then  we  will  still  have  a  consistent  estimator  of  the  assumed  form,  but 
the  relevance  of  inference  is  open  to  question.  For  example,  suppose  the  true 
relationship  is  quadratic,  but  we  incorrectly  assume  a  linear  model.  The  linear 
association  will  still  be  consistently  estimated  but  may  be  very  misleading.  Deciding 
on  a  course  of  action  if  the  mean  model  is  inadequate  depends  on  the  nature 
of  the  analysis.  If  we  are  in  exploratory  mode,  then  fitting  different  models  is 
not  problematic.  But  if  we  are  in  confirmatory  mode,  then  we  would  want  to 
minimize  changes  to  the  model,  though  knowing  of  inadequacies  is  important  for 
interpretation. 

The  use  of  a  sandwich  estimate  for  the  standard  errors  is  reliable  in  the  sense 
of  giving  consistent  estimates  regardless  of  whether  the  working  covariance  model 
mimics  the  truth,  but  a  working  model  that  is  far  from  the  truth  will  lead  to  a  loss 
of  efficiency  (so  that  the  standard  errors  are  bigger  than  they  need  to  be),  which 
suggests  one  should  examine  whether  the  assumed  working  model  is  far  from  that 
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suggested  by  the  data.  In  addition,  if  the  number  of  units  to  is  not  large,  then  the 
estimate  of  the  sandwich  standard  errors  could  be  very  unstable,  and  asymptotic 
inference  may  be  inappropriate.  As  usual,  there  is  no  easy  recipe  for  deciding 
whether  m  is  “sufficiently  large”,  since  this  depends  on  the  design  across  individuals 
in  the  sample.  The  decision  may  be  based  on  simulation,  though  experience  with 
similar  datasets  is  beneficial. 

For  the  LMM,  the  usual  model  is 

Yt  =  xt/3  +  Zibi  +  ej, 

with  bj  |  D  ~iid  N,+i(  0,  D),  et  \  a\  ~i„dNni(  0,  of  I ),  and  bi,  independent, 
i  =  1, . . . ,  m.  This  leads  to  the  marginal  model  Y)  \  (3,  a  ~  Nni  ( Xi/3,  Vi )  and 
estimator 

3  =  (xTV~1x)~1  xTV~ lY  (8.47) 

with 

{xtV-1x)1/2  (3-/9)~Nfc+1(  0,1). 

Therefore,  if  m  is  large,  we  do  not  require  the  data  or  the  random  effects  to  be 
normally  distributed  since  the  estimator  is  linear  in  the  data,  and  so  we  can  appeal  to 
a  central  limit  theorem.  For  an  accurate  standard  error,  we  require  the  model-based 
form  of  the  variance  to  be  close  to  the  truth,  however.  It  is  particularly  important 
that  there  are  no  unmodeled  mean-variance  relationships.  Another  key  requirement 
is  that  the  random  effects  arise  from  a  common  distribution.  Often,  unit-specific 
covariates  will  be  available,  and  these  may  define  subpopulations  that  have  different 
distributions  (e.g.,  differing  variance-covariance  matrices  D)  in  covariate-defined 
subpopulations.  If  to  is  small  we  require,  in  addition,  the  data  to  be  “close  to  normal” 
for  valid  inference.  Sandwich  estimation  can  be  easily  applied  to  obtain  an  empirical 
standard  error,  keeping  in  mind  the  caveats  expressed  above  with  regard  to  the  need 
for  sufficiently  large  to. 

For  prediction  of  the  random  effects,  we  have  seen  that  the  BLUP  estimator  is 
optimal  under  a  number  of  different  criteria.  Normality  of  the  random  effects  or  the 
errors  is  not  required,  though  an  appropriate  variance  model  is  again  important. 

A  Bayesian  analysis  of  the  LMM  adds  hyperpriors  for  f3  and  a  to  the  two-stage 
likelihood  model.  Each  of  the  modeling  assumptions  required  for  likelihood-based 
inference  are  needed  for  a  Bayesian  analysis.  However,  asymptotic  inference  is  not 
needed  if,  for  example,  MCMC  is  used.  Accurate  inference  requires  checking  of 
the  first  and  second  stage  assumptions  because  inference  relies  on  the  model  being 
correct  (or  in  practice,  close  to  correct).  Also,  thought  is  required  when  priors  are 
specified  because  inference  may  well  be  sensitive  to  the  choices  made.  In  particular, 
care  is  called  for  in  the  specification  for  D.  We  emphasize  that  normality  of  the  data 
and  the  random  effects  is  not  needed  for  a  valid  analysis  if  the  sample  size  is  large. 
For  example,  for  inference  with  respect  to  /3,  the  posterior  for  f3  will  be  accurate  so 
long  as  the  asymptotic  distribution  of  the  estimator,  (8.47),  is  faithful.  Essentially, 
the  asymptotic  distribution  replaces  the  likelihood  contribution  to  the  posterior. 
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8.8.2  Approaches  to  Assessment 

For  those  individuals  with  sufficient  data,  individual-specific  models  may  be  fitted  to 
allow  examination  of  the  appropriateness  of  initially  hypothesized  models  in  terms 
of  the  linear  component  and  assumptions  about  the  errors,  such  as  constant  variance, 
serial  correlation,  and  normality  if  m  is  small.  Following  the  fitting  of  marginal  or 
mixed  models,  the  assumptions  may  then  be  assessed  further,  with  examination  of 
residuals  a  useful  exercise. 

Residuals  may  be  defined  with  respect  to  different  levels.  With  respect  to  the 
usual  LMM,  a  vector  of  unstandardized  population-level  (marginal)  residuals  is 

ei  =Yi  —  xt/3 

and  these  are  most  useful  for  analyses  based  on  the  marginal  (GEE)  approach. 
A  vector  of  unstandardized  unit-level  (stage  one)  residuals  is 

ei  =  Yi-  Xi(3  -  zM- 

The  vector  of  random  effects  h;  is  also  a  form  of  (stage  two)  residual.  Estimated 
versions  of  these  residuals  are 


e.i=Yi-  Xifl  (8.48) 

£i  =Yi  —  x,j3  -  Zibi  (8.49) 

and  bi,  i  =  1, . . . ,  m. 

We  first  discuss  the  population  residuals  (8.48).  Recall,  from  consideration  of 
the  ordinary  linear  model  (Sect.  5.11),  that  estimated  residuals  have  dependencies 
induced  by  replacement  of  parameters  by  their  estimates.  The  situation  is  far 
worse  for  dependent  data  because  we  would  expect  the  population  residuals  to  be 
dependent,  even  if  the  true  parameter  values  were  known.  If  VI  (a)  is  the  true  error 
structure,  then 

var(e,)  =  Vi  and  var(ej)  «  Vi(a), 

showing  the  dependence  of  the  residuals  under  the  model.  This  means  that,  when 
working  with  e*,  it  is  difficult  to  check  whether  the  covariance  model  is  correctly 
specified.  Plotting  versus  the  /th  covariate  Xiji,  l  =  1  may  also  be 

misleading  due  to  the  dependence  within  the  residuals.  Therefore,  standardization 
is  essential  to  remove  the  dependence. 

Let  V  =  LiL 1  be  the  Cholesky  decomposition  of  Vi  =  Vi(a).  We  can  use  this 
decomposition  to  form 

e*  =  L~1ei  =  L~1(Yi  —  Xif3) 

so  that  var(  e  * )  «  Ini.  We  may  then  work  with  the  model 


Y*=x*(3  +  e* 
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where  Y*  =  L~1Yl,  x *  =  L~1xi,  and  e*  =  L~1ei.  Plots  of  e*-  against  x*jt, 
l  =  1, ...  ,k  should  not  show  systematic  patterns  if  the  assumed  linear  form  is 
correct. 

QQ  plots  of  e*j  versus  the  expected  residuals  from  a  normal  distribution  can 
be  used  to  assess  normality  (unless  m  is  small,  normal  errors  are  not  required 
for  accurate  inference,  but  the  closer  to  normality  are  the  data,  the  smaller  the  m 
required  for  the  asymptotics  to  have  practically  “kicked  in”).  If  are  normal, 
then  standardized  residuals  will  be  normally  distributed  also,  since  e*  is  a  linear 
combination  of  elements  of  e, . 

The  correctness  of  the  mean-variance  relationship  can  be  assessed  by  plotting 
e*2  (or  \e*3\)  against  fitted  values  pA  =  ce*/3.  Any  systematic  (non-horizontal) 
trends  suggest  problems.  Local  smoothers  (as  described  in  Chap.  1 1)  can  be  added 
to  plots  to  aid  interpretation  and  plotting  symbols  such  as  unit  or  observation  number 
can  also  be  useful  to  identify  collections  of  observations  for  which  the  model  is  not 
adequate. 

For  the  LMM  with  |  <r2  ~ad  Nni  (  0,  ct2I ),  the  stage  one  residuals  (8.49)  may 
be  formed.  Standardized  versions  are  L*  =  7_v}  j  a  f.  As  usual,  these  residuals  may 
be  plotted  against  covariates.  One  may  construct  normal  QQ  plots,  though  a  correct 
mean-variance  relationship  is  more  influential  than  lack  of  normality  (so  long  as  the 
sample  size  is  not  small).  The  constant  variance  assumption  may  be  examined  via  a 
plot  of  2  (or  |q*  |)  versus  fcj  =  x^fi  +  Zijbi. 

Recall  the  model 

xji  =  Xif3  +  zM  +  Si  +  €i,  (8.50) 

introduced  in  Sect.  8.4.2,  with  bi  \  D  ~ud  N9+i(  0,  D )  and  |  <r2  ~ud 
N„ .  (  0,  cr2I )  representing  random  effects  and  measurement  error  and  dij  being 
zero-mean  normal  error  terms  with  serial  dependence  in  time.  A  simple  and 
commonly  used  form  for  serial  dependence  is  the  AR(1)  model  (also  described  in 
Sect.  8.4.2)  which  gives  covariances 

cov(Sij,Sik)  =  =  a'gRijk ■ 

Conditional  on  bi ,  this  leads  to  the  variance-covariance  for  responses  on  unit  i: 

var(Yi  |  bi)  =  Vi  =  cr2sRt  +  a\lni .  (8.51) 

If  model  (8.50)  is  fitted,  then  residuals  of  the  form  (8.49)  may  be  formed,  but 
these  should  be  standardized  in  the  same  way  as  just  described  for  population 
residuals  (i.e.,  using  the  decomposition  Vt  =  L,  L])  since  they  will  have  marginal 
variance  (8.51). 

In  a  temporal  setting,  one  may  want  to  detect  whether  serial  correlation  is  present 
in  the  residuals.  Two  tools  for  such  detection  are  the  autocorrelation  function  and  the 
semi-variogram.  We  describe  the  autocorrelation  function  and  the  semi-variogram 
generically  with  respect  to  the  model 


—  Pt  +  e<) 
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for  t  =  1 , ,n.  We  assume  the  error  terms  e*  are  second-order  stationary,  which 
means  that  E[et]  =  p  is  constant,  and  co v(et,et+d)  =  C'(d),  where  d  >  0,  that 
is,  the  covariance  only  depends  on  the  temporal  spacing  between  the  variables. 
This  implies  that  the  variance  of  et  is  constant,  and  equal  to  C(0),  for  all  t.  The 
autocorrelation  function  (ACF)  is  defined,  for  time  points  d  >  0  apart,  as 


p(d)  = 


co v(et,et+d)  __  C{d) 

v/var(et)var(et+d)  C(0) 


for  all  t.  Now,  suppose  we  have  estimates  of  the  errors  et  for  responses  equally 
spaced  over  time,  which  we  label  as  t  =  1, . . . ,  n.  The  empirical  ACF  is  defined  as 


p{d)  = 


C(d)  £r=i detet+d/(n-d) 
C(  0)  ' 


for  d  =  0, 1, . . . ,  n—  1.  A  correlogram  plots  p(d)  versus  d  for  d  =  0, 1, 2, . . . ,  n—  1. 
If  the  residuals  are  a  white  noise  process  (i.e.,  uncorrelated),  then  asymptotically 

sfn  p(d)  ->• d  N(0, 1), 

for  d  =  1,  2, . . .,  to  give,  for  example,  95%  confidence  bands  of  ±1.96/-^. 

We  now  turn  to  a  description  of  the  semi-variogram,  a  tool  which  was  introduced 
by  Matheron  (1971)  in  the  context  of  spatial  analysis  (more  specifically,  geostatis¬ 
tics)  and  is  described  in  the  context  of  longitudinal  data  by  Diggle  et  al.  (2002, 
Chap.  3.4).  Define  the  semi-variogram  of  the  residuals  et,  as 


1  1  r  2 

7(d)  =  -var(et  -  et+d)  =  -E  -  et+d) 


for  d  >  0.  The  reason  for  the  1/2  term  will  soon  become  apparent.  The  semi- 
variogram  exists  under  weaker  conditions  than  the  ACF,  specifically  under  intrinsic 
stationarity,  which  means  that  et  has  constant  mean  and  var(et  —  et+d)  only  depends 
on  d  (so  that  the  covariance  need  not  be  defined).  For  zero-mean  error  terms  and 
under  second-order  stationarity, 

7(d)  =  |var(et)  +  ^var (et+d)  -  cov{et,  et+d ) 

=  C(0)  -  C(d) 

=  C(0)[1  —  p{d)\. 

Suppose  we  now  have  estimated  errors  q,  along  with  associated  times  ti,  l  = 
1 , ,n.  The  sample  semi-variogram  uses  the  empirical  halved  squared  differences 
between  pairs  of  residuals 

vw  =  ~  ^')2’ 
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Fig.  8.4  Theoretical 
(semi- jvariogram 
corresponding  to  (8.53)  with 
=  1,  ff|  =  4  and  p  =  0.3 
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along  with  the  spacings  dw  =  \ti  —  £p |  for  l  =  1 ,n  and  l  <  l'  =  1, . . . ,  n. 
With  irregular  sampling  times,  the  variogram  can  be  estimated  from  the  pairs 
(dii1 ,  vu/),  with  the  resultant  plot  being  smoothed.2  An  example  of  such  a  plot  is 
given  in  Fig.  8.9.  Under  normality  of  the  data,  the  marginal  distribution  of  each  vw 
is  C( 0)xi,  and  this  large  variability  can  make  the  variogram  difficult  to  interpret.  In 
addition,  because  each  residual  contributes  to  n  —  1  terms  in  the  empirical  cloud  of 
points,  the  points  are  not  independent,  and  a  single  outlying  point  can  influence  the 
plot  at  different  time  lags. 

Suppose  now  we  are  in  a  longitudinal  setting,  in  which  response  y,:j  is  observed 
at  time  ti3 ,  and  we  fit  the  LMM 

Vi  =  Xi(3  +  Zibi  +  ei,  (8.52) 

with  the  usual  forms  for  b,  and  e, .  After  fitting,  we  form  the  stage  one  residu¬ 
als  (8.49),  that  is,  —  Xij(3  —  Zijbi.  We  might  believe  the  serial  dependence 

takes  the  same  form  across  individuals.  For  equally  spaced  times,  we  can  examine 
the  empirical  ACF  of  the  residuals  where,  for  simplicity,  we  assume  that  there  are  n 
responses  on  each  of  the  m  individuals, 

»= i  Lj=i  eij  ti,j+d/(n  -  d) 

PW  —  ^2  /„  > 

1  tij/  n 

for  d  =  0, 1, . . .  ,n  —  1. 


2For  unequally  spaced  times,  the  longitudinal  data  literature  often  recommends  the  construction 
of  the  empirical  semi-variogram  (Diggle  et  al.  2002,  Sect.  3.4;  Fitzmaurice  et  al.  2004,  Sect.  9.4), 
though  one  could  construct  and  smooth  the  empirical  covariance  function  in  a  similar  fashion. 
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Now  suppose  that  we  again  fit  model  (8.52),  and  we  have  n,  responses  for 
individual  i  with  sampling  times  tt) .  We  then  define  the  semi-variogram  for  the 
zth  individual  as 


7 i{dijk)  2  E 


where  dijk  =  \Uj  —  Uk\-  We  now  form 

vijk  =  ~~  eik) 


and  the  semi-variogram  can  then  be  estimated  by  plotting  the  pairs  ( dijk,Vijk )  for 
i  =  1 , ,m  and  j  <  k  =  1 , . . . ,  m  and  smoothing.  If  no  serial  dependence  is 
present,  the  smoother  should  be  roughly  horizontal. 

Consider  the  interpretation  of  the  variogram  when  model  (8.50)  is  the  “truth,” 
but  suppose  we  fit  a  LMM  without  the  autocorrelated  terms.  We  consider  stage  one 
residuals,  which  under  (8.50)  will  take  the  form 


*7  ^ij  Si- 


■ 


For  differences  in  residuals  on  the  same  individual, 

eij  ~  eik  =  fiij  +  eij  —  ~  £ ik 

—  &ik)  T  i^ij 

and  so  the  semi-variogram  takes  the  form 

7 Mijk)  =  [(eC  -  e'ikf) 

=  r^E  [ ( Sik )  7  (c ij  £ifc)  ] 

=  a2s[l-p(dijk)]+a2e.  (8.53) 

As  d^k  -7  0,  7 i{dijk)  -7  cr2.  The  rate  at  which  asymptote  a2  +  a 2  is  reached  as 
d^k  — >  oo  is  determined  by  p.  This  variogram  is  illustrated  in  Fig.  8.4. 

We  now  briefly  consider  the  use  of  population  residuals,  starting  with  the  random 
intercepts  model: 

F/j  —  Xij(3  7  br  T  Sij  7  €{j , 

with  bi  |  (Jq  N(0,  Uq)  and  the  AR(1)  model  for  Sij.  The  population  residuals 
under  this  model  are 

C{j  —  Xijf3  —  bi  7  Sij  7  , 

i  =  1, . . . ,  m;  j  =  1, . .  .Hi.  For  differences  in  residuals  on  the  same  individual, 
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—  bi  ~ f*  ^ ij  T"  bi  &ik  e ik 

—  T  (€ij  eifc)> 

and  so  we  obtain  the  same  semi-variogram,  (8.53),  as  before.  Since  bi  is  constant 
for  individual  i,  its  variance  does  not  appear. 

In  general,  the  variogram  is  limited  in  its  use  for  population  residuals  for 
the  LMM,  as  we  now  illustrate.  Consider  the  LMM  with  random  intercepts  and 
independent  random  slopes: 

ho  |  A)  ~  N(0,  D0),  ba  |  Di  ~  N(0,  Di). 

This  leads  to  marginal  variance 


var  (Yij)  =  of  +  D0  +  D^, 


which  is  not  constant  over  time.  Therefore,  a  semi-variogram  of  population  residuals 
should  not  be  constructed,  because  we  do  not  have  second-order  stationarity. 

Predictions  of  the  random  effects  bi  may  be  used  to  assess  assumptions  as¬ 
sociated  with  the  random  effects  distribution,  though  since  these  have  undergone 
shrinkage,  they  may  be  deceptive.  One  may  instead  carry  out  individual  fitting 
and  then  use  the  resultant  estimates  to  assess  the  normality  assumption.  The  latter 
may  be  assessed  via  QQ  plots,  but  the  interpretation  of  plots  requires  care  since 
estimates  and  not  observed  quantities  are  being  plotted;  see  Lange  and  Ryan  (1989). 
We  may  also  assess  whether  the  variance  of  the  random  effects  is  independent 
of  covariates  tc;.  If  the  spread  of  the  random  effects  distribution  depends  on 
the  levels  of  covariates,  and  this  is  missed,  then  inaccurate  inference  can  result 
(Heagerty  and  Kurland  2001).  For  the  LMM,  it  is  better  to  examine  stage  one  and 
stage  two  residuals  separately,  rather  than  population  residuals,  since  the  latter  are 
a  mixture  of  the  two,  and  so,  if  something  appears  amiss,  it  is  difficult  to  determine 
the  stage  at  which  the  inadequacy  is  occurring.  As  usual,  as  discussed  in  Sect.  4.9, 
the  implications  of  changing  the  model  should  be  carefully  considered,  and  one 
should  avoid  the  temptation  to  model  every  nuance  of  the  data. 


Example:  FEV1  Over  Time 

The  dental  data  that  have  formed  our  running  illustration  are  balanced,  and  there 
are  few  individuals  and  time  points,  and  so,  these  data  are  not  ideal  for  illustrating 
model  checking.  Hence,  we  introduce  data  from  an  epidemiological  study  described 
by  van  der  Lende  et  al.  (1981).  We  analyze  a  sample  of  133  men  and  women, 
initially  aged  15-44,  from  the  rural  area  of  Vlagtwedde  in  the  Netherlands.  Study 
participants  were  followed  over  time  to  obtain  information  on  the  prevalence  of, 
and  risk  factors  for,  chronic  obstructive  lung  diseases.  These  data  were  previously 
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Table  8.4  Mean  FEV 1  (and 
sample  size)  by  smoking 
status  and  time 


Time 

Former  smoker 

Current  smoker 

0 

3.52  (23) 

3.23  (85) 

3 

3.58  (27) 

3.12(95) 

6 

3.26  (28) 

3.09  (89) 

9 

3.17(30) 

2.87  (85) 

12 

3.14(29) 

2.80(81) 

15 

2.87  (24) 

2.68  (73) 

19 

2.91  (28) 

2.50  (74) 

Fig.  8.5  Mean  FEV  1  profiles 
versus  time  for  two  smoking 
groups 


analyzed  by  Fitzmaurice  et  al.  (2004).  Follow-up  surveys  provided  information  on 
respiratory  symptoms  and  smoking  status.  Pulmonary  function  was  measured  by 
spirometry,  and  a  measure  of  forced  expiratory  volume  (FEV1)  was  obtained  every 
3  years  for  the  first  15  years  of  the  study  and  also  at  year  19.  Each  study  participant 
was  either  a  current  or  a  former  smoker,  with  current  smoking  defined  as  smoking  at 
least  one  cigarette  per  day.  In  this  dataset,  FEV  1  was  not  recorded  for  every  subject 
at  each  of  the  planned  measurement  occasions  so  that  the  number  of  measurements 
of  FEV1  on  each  subject  varied  between  1  and  7.  Table  8.4  shows  the  numbers  of 
observations  available  at  each  time  point.  There  are  32  former  smokers  and  101 
current  smokers  in  total,  and  we  see  that  the  numbers  with  missing  observations  at 
each  time  point  are  not  drastically  different. 

Figure  8.5  plots  the  mean  FEV1  profiles  versus  time  for  former  smokers  (solid 
line)  and  current  smokers  (dashed  line).  It  is  clear  that  there  is  a  difference  in 
the  overall  level,  with  former  smokers  having  higher  responses.  Whether  the  rate 
of  decline  in  FEV1  is  different  in  the  two  groups  is  not  so  obvious.  Figure  8.6 
plots  the  individual  trajectories  versus  time  for  former  smokers  (solid  lines)  and 
current  smokers  (dashed  lines).  There  is  clearly  large  between-individual  variability 
in  levels  so  that  observations  on  the  same  individual  will  be  correlated. 
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Fig.  8.6  FEV1  versus  time  for  133  individuals,  former  and  current  smokers  are  indicated  by  solid 
and  dashed  lines  respectively 


Let  Yij  represent  the  FEV1  on  individual  i  at  time  (from  baseline)  f,,?  (in 
years),  with  Si  =  0/1  indicating  former/current  smoker.  We  treat  this  example  as 
illustrative  only  and  therefore  fit  various  models  to  examine  the  effects  on  inference 
and  to  demonstrate  model  assessment  and  comparison.  We  initially  fit  the  following 
three  models  using  REML: 

Yij  =  Po  +  Pi  tij  +  bi  +  Cij  (8.54) 

Yij  =  Po  +  Pi  tij  +  /?2<Si  +  bi  +  tij  (8.55) 

Yij  =  Po  +  Pi +  p2Si  +  PzSi  x  tij  +  bi  +  Cij  (8.56) 

with  bi  |  Cg  N(0,  CTq)  and  \  a\  N(0,  cr^)  and  with  and  bi 
independent,  i  =  1 , ,m,  j  =  1 , . . . ,  rr, .  We  emphasize  that  the  random  effect 
distribution  is  assumed  common  to  both  former  and  current  smokers.  Estimates  and 
standard  errors  for  pi,p2,  and  Ps  are  given  in  Table  8.5.  We  include  an  ordinary 
least  squares  (OLS)  fit  of  the  model  E [Yy]  =  p0  +  PiUj  +  p2Si  +  p2Si  x  tij. 
This  model  is  clearly  inappropriate  since  it  assumes  independent  observations  but, 
when  compared  to  the  equivalent  LMM,  (8.56),  illustrates  that  the  standard  errors  of 
the  estimates  corresponding  to  time-varying  covariates  (time  Pi  and  the  interaction 
Ps)  are  reduced  under  the  LMM.  This  behavior  occurs  because  within-individual 
comparisons  are  more  efficient  in  a  longitudinal  study  (as  discussed  in  Sect.  8.3). 
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Table  8.5  Results  of  various  LMM  analyses  and  an  ordinary  least  squares  (OLS)  fit  to  the 
FEV1  data 


Model 

01  (Time) 

s.e.  02  (Smoke)  s.e. 

03  (Inter) 

s.e. 

LMM  TIME 

-0.037 

0.0013  - 

- 

- 

- 

LMM  TIME  +  SMOKE 

-0.037 

0.0013  -0.31 

0.11 

- 

- 

LMM  TIME  x  SMOKE 

-0.034 

0.0026  -0.27 

0.11 

-0.0046 

0.0030 

OLS  TIME  x  SMOKE 

-0.038 

0.0067  -0.31 

0.085 

-0.00041 

0.0077 

Table  8.6  Results  of  LMM  (likelihood  and  Bayesian)  and  GEE  analyses  for  the  FEV1  data 

Model 

0!  (Time) 

s.e. 

02  (Smoke) 

s.e. 

°0 

Likelihood  LMM 

-0.037 

0.0013 

-0.31 

0.11 

0.53 

Bayes  LMM 

-0.037 

0.0013 

-0.31 

0.12 

0.53 

GEE 

-0.037 

0.0015 

-0.31 

0.11 

- 

Likelihood  LMM  AR(1) 

-0.037 

0.0013 

-0.31 

0.11 

0.53 

o-q  is  the  standard  deviation  of  the  random  intercepts 


To  compare  the  three  LMMs  in  Table  8.5,  we  must  use  MLE  for  likelihood  ratio 
tests,  since  the  data  are  not  constant  under  the  different  models  under  REML  (due 
to  different  (30,  Sect.  8.5.3).  For 

H0  :  Model  (8.54)  versus  Hi  :  Model  (8.55) 

we  have  a  likelihood  ratio  statistic  of  8.22  on  1  degree  of  freedom  and  a  p-value  of 
0.0042.  Hence,  there  is  strong  evidence  to  reject  the  null,  and  we  conclude  that  there 
are  differences  in  intercepts  for  former  and  current  smokers  (as  we  suspected  from 
Fig.  8.5).  For 

H0  :  Model  (8.55)  versus  Hi  :  Model  (8.56) 

we  have  a  likelihood  ratio  statistic  of  2.29  on  1  degree  of  freedom  and  a  p-value  of 
0.13.  Hence,  under  conventional  levels  of  significance,  there  is  no  reason  to  reject 
the  null,  and  we  conclude  that  the  interaction  is  not  needed,  so  that  the  decline  in 
FEV1  with  time  is  the  same  for  both  former  and  current  smokers. 

We  now  report  a  Bayesian  analysis  of  model  (8.55)  with  improper  flat  priors 
on  /?o,/?i,/32-  the  improper  prior  of  oc  and  (Jq2  ~  Ga(0.5,  0.02).  The  latter 
prior  gives  95%  of  its  mass  for  <jo,  the  standard  deviation  of  the  between-individual 
intercepts,  between  0.09  and  6.5.  Table  8.6  gives  the  results,  which  are  very  similar 
to  those  of  the  likelihood-based  approach,  which  is  reassuring. 

We  now  fit  the  marginal  model  version  of  (8.55)  using  GEE.  We  use  an 
exchangeable  correlation  structure,  since  clearly  we  have  dependence  between 
measurements  on  the  same  individual  at  different  times,  but  the  exact  form  of 
the  correlation  is  not  clear.  The  results  are  given  in  Table  8.6  and  again  show 
good  agreement  for  the  regression  coefficients.  In  the  exchangeable  correlation 
structure,  there  are  two  components  to  a,  a  marginal  variance,  ai,  and  a  common 
marginal  correlation,  a  2.  The  model  may  be  compared  to  the  random  intercepts 
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Smoking  status 


b 


Time  (years) 


2  3  4  5 

Fitted  values 


Fig.  8.7  Stage  one  residual  plots  for  the  FEV 1  data:  (a)  normal  QQ  plot,  (b)  residuals  versus  time, 
(c)  residuals  as  a  function  of  smoking  status  (0  =  former  smoker,  1  =  current  smoker),  (d)  absolute 
value  of  residuals  versus  fitted  values 


model  in  which  we  have  marginal  variance  cti  =  of  +  of  and  marginal  correlation 
a2  =  Oo/{pa  +  of).  From  the  GEE  analysis,  Si  =  0.31  and  S2  =  0.82  to  give 
\/Si  x  <$2  =  0.50,  which  is  comparable  to  the  estimates  of  op  =  0.53  in  Table  8.6. 

We  now  examine  the  assumptions  of  the  various  approaches.  We  focus  on 
the  linear  model  that  includes  time  and  smoking  (but  no  interaction).  Figure  8.7 
summarizes  the  stage  one  residuals: 

C  ij  =  Iji/j  Xijfi  Zibi- 

Panel  (a)  shows  that  the  distribution  of  the  errors  is  symmetric  but  heavier  tailed  than 
normal.  With  such  a  large  sample,  there  is  nothing  troubling  in  this  plot,  and,  there 
are  no  outlying  points.  Panels  (b)  and  (c)  plot  the  residuals  against  time  and  smoking 
status.  We  see  no  nonlinear  behavior  in  the  time  plot  and  no  great  divergence 
from  constant  variance  in  either  plot.  A  very  important  assumption  in  mixed 
effects  modeling  is  that  a  common  random  effects  distribution  across  covariates 
is  appropriate.  To  examine  this  assumption,  separate  analyses  were  carried  out  for 
former  and  current  smokers.  The  estimates  of  the  variance  components  for  former 
smokers  were  ae  =  0.22  and  op  =  0.58  and  for  current  smokers,  ae  =  0.21  and 
(Tq  =  0.51.  The  differences  between  estimates  in  the  two  groups  are  small,  and  we 
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Theoretical  Quantiles  Theoretical  Quantiles 


Fig.  8.8  Normal  QQ  plots  of  OLS  estimates  for  the  FEV1  data  for  (a)  intercepts,  (b)  slopes,  and 
(c)  scatterplot  of  pairs  of  least  square  estimates 


conclude  that  a  common  random  effects  distribution  is  reasonable.  Panel  (d)  plots 
the  absolute  value  of  the  residuals  versus  the  fitted  values  Xij/3  +  b, ,  along  with 
a  smoother.  If  the  variance  function  is  correctly  specified,  then  we  should  see  no 
systematic  pattern.  Here,  there  is  nothing  to  be  too  concerned  about  since  there  is 
only  a  slight  increase  in  variability  as  the  mean  increases.  These  residual  plots  are 
based  on  residuals  from  the  likelihood  analysis  (the  Bayesian  versions  are  similar). 

For  the  132  individuals  with  more  than  a  single  response,  individual  OLS  fits 
were  performed.  Figure  8.8  shows  normal  QQ  plots  of  the  intercept  and  slope 
parameter  estimates  in  panels  (a)  and  (b)  and  a  bivariate  scatter  plot  of  the  pairs  of 
estimates  in  panel  (c).  The  estimates  look  remarkably  normal,  at  least  in  (a)  and  (b), 
and  there  are  no  outlying  individuals. 

Finally,  we  examine  the  residuals  for  serial  correlation.  Figure  8.9  gives  the  semi- 
variogram  of  the  stage  one  residuals  along  with  a  smoother  and  indicates  some 
evidence  of  dependence.  In  panel  (a),  the  pattern  is  not  apparent,  but  in  panel  (b), 
the  semi-variance  axis  is  reduced  for  clarity,  which  allows  the  trend  to  be  more 
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Time  (years)  Time  (years) 

Fig.  8.9  For  the  FEV1  data:  (a)  the  (semi)-variogram  of  stage  one  residuals,  (b)  on  a  truncated 
semi-variogram  scale 


clearly  seen.  Consequently,  we  fit  an  AR(1)  model  to  the  residuals  (Sect.  8.4.2), 
using  restricted  maximum  likelihood,  and  obtain  the  parameter  estimates  in  the 
last  row  of  Table  8.6.  This  model  is  a  significant  improvement  over  the  non-serial 
correlation  model  (as  measured  by  a  likelihood  ratio  test,  p  =  0.0002).  However, 
there  is  virtually  no  change  in  the  estimates/standard  errors,  since  the  AR  correlation 
parameter  is  just  0.20,  with  an  asymptotic  95%  confidence  interval  of  [0.087,  0.30]. 

We  may  also  examine  whether  random  slopes  are  required.  Fitting  this  model  via 
restricted  likelihood  gave  a  standard  deviation  of  a\  =  0.0099.  The  likelihood  ratio 
statistic  test  for  correlated  random  intercepts  and  slopes,  versus  random  intercepts 
only,  is  10.9  which  is  significant  at  around  the  0.0025  level  (where  the  distribution 
under  the  null  is  a  mixture  of  xi  and  distributions,  see  Sect.  8.5.2). 

In  terms  of  the  fixed  effects,  there  is  little  sensitivity  to  the  assumed  random 
effects  structure.  Inference  under  the  random  intercepts  and  slopes  models  is  similar 
to  the  random  intercepts  only  model,  since  the  between-individual  variability  in 
slopes  is  small  (though  statistically  significant).  The  population  change  in  FEV1  is 
a  drop  of  0.037 1  per  year,  with  a  standard  error  of  0.0013-0.0015  depending  on  the 
model.  The  posterior  median  for  the  intraindividual  correlation,  crg/(cr^  +  <7q),  is 
0.84  with  95%  interval  [0.82,  0.89]  suggesting  that  the  majority  of  the  variability  in 
FEV 1  is  between  individual. 


8.9  Cohort  and  Longitudinal  Effects 

We  now  describe  another  benefit  of  longitudinal  studies,  the  ability  to  estimate  both 
longitudinal  and  cohort  effects.  We  frame  the  discussion  around  the  modeling  of 
Y  =  FEV  1  as  a  function  of  age.  We  might  envisage  that  FEV  1  changes  as  age 
increases  within  an  individual  and  that  individuals  may  have  different  baseline  levels 
of  FEV i  due  to  “cohort”  effects.  A  birth  cohort  is  a  group  of  individuals  who  were 
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Fig.  8.10  Three  population 
(cohort)  trajectories  over  time 


Fig.  8.11  Relationship 
between  cross-sectional  and 
longitudinal  effects  in  a 
hypothetical  example  with 
three  populations.  The  dashed 
line  (which  is  the  top  line) 
represents  the  cross-sectional 
slope 


born  in  the  same  year.  Cohort  effects  may  include  the  effects  of  environmental 
pollutants  and  differences  in  lifestyle  choices  or  medical  treatment  received.  In  a 
cross-sectional  study,  a  group  of  individuals  are  measured  at  a  single  time  point.  A 
great  advantage  of  longitudinal  studies,  as  compared  to  cross-sectional  studies,  is 
that  both  cohort  and  aging  (longitudinal)  effects  may  be  estimated. 

As  an  illustration.  Fig.  8.10  shows  the  trajectories  of  three  hypothetical  individu¬ 
als  as  a  function  of  calendar  time.  The  starting  positions  are  different  due  to  cohort 
effects.  Figure  8.1 1  shows  the  same  individuals  but  with  trajectories  plotted  versus 
age.  The  cross-sectional  association,  which  would  result  from  observing  the  final 
measurement  only,  is  highlighted  and  displays  a  steeper  decline  than  seen  in  the 
longitudinal  slope. 
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To  examine  in  more  detail  the  issues,  consider  the  model 

E \Y'ij  |  Xij ,  Xii]  —  /?o  PcXi\  T  fi^Xij  Xu) 

where  Yjj  is  the  yth  FEV1  measurement  on  individual  i  and  x^  is  the  age  of  the 
individual  at  occasion  j,  with  Xu  being  the  age  on  a  certain  day  (so  that  all  the 
individuals  are  comparable).  At  the  first  occasion, 


E[lii  |  xn]  =  /30  +  pcx a, 


so  that  /3C  is  the  average  change  in  response  between  two  populations  who  differ  by 
one  unit  in  their  baseline  ages.  Said  another  way,  we  are  examining  the  differences 
in  FEV1  between  two  birth  cohorts  a  year  apart,  so  that  /3C  is  the  cohort  effect. 
Since 

E \Y%j  |  XijjXil)  E[T^1  |  Xu]  —  /?l (.Xij  Xu ) 

it  is  evident  that  /3L  is  the  longitudinal  effect ,  that  is,  the  change  in  the  average  FEV 1 
between  two  populations  who  are  in  the  same  birth  cohort  and  whose  ages  differ  by 
1  year.  The  usual  cross-sectional  model  is 

E [Yij  |  Xij]  =/30  +  PiXij  (8.57) 

=  /?0  +  PlXil  +  Pl{Xij  -  Xu) 


so  that  the  model  implicitly  assumes  equal  longitudinal  and  cohort  effects,  that 

is,  /3i  =  /3C  =  /3l- 

In  the  cross-sectional  study  with  model  (8.57), 


A 


V"'  (x  - 

Z^i=i  i\Xi] 

Em  sr^rii 

i=  1  Z^j=l 


-x)(Yjj  - 
[Xij  -  x)2 


Y ) 


with  x=Y  ^  E"=i  Xij  and  Y  =  ±  E™  i  E 

expected  value  of  this  estimator  is 


rii 

j= i 


Yij  with  N  =  Yh=i  ni-  The 


E[3i]  =  A.  + 


El'Ll  nifxn  -  Xi )(xi  -  x)  , 

(t  -  ^ c 


A.) 


(8.58) 


(Exercise  8.15)  so  that  the  estimate  is  a  combination  of  cohort  and  longitudinal 
effects.  The  cross-sectional  regression  model  will  give  an  unbiased  estimate  of  the 
longitudinal  association  if  /3L  =  3C  or  if  {xu  }  and  { x, }  are  orthogonal.  To  conclude, 
longitudinal  studies  can  be  powerfully  employed  to  separate  cohort  and  longitudinal 
effects. 


416 


8  Linear  Models 


8.10  Concluding  Remarks 

In  this  chapter  we  have  described  two  approaches  to  fitting  linear  models  to 
dependent  data:  LMMs  and  GEE.  GEE  has  the  fewest  assumptions  and  is  designed 
for  population-level  inference.  Asymptotics  are  required  for  inference,  and  so,  GEE 
is  less  appealing  when  the  number  of  individuals  m  is  small.  A  sufficiently  large 
sample  size  is  required  for  both  normality  of  the  estimator  and  reliability  of  the 
sandwich  variance  estimator.  The  use  of  the  sandwich  variance  estimator  makes 
GEE  the  most  dependable  method  in  large  sample  situations.  However,  there  can  be 
losses  in  efficiency  if  we  choose  a  working  correlation  matrix  that  is  far  from  reality. 
With  GEE,  it  is  not  possible  to  make  inference  for  individuals  or  incorporate  prior 
information. 

LMMs  are  more  flexible  than  GEE  in  terms  of  the  questions  that  can  be 
addressed  with  the  data,  but  this  flexibility  comes  at  the  price  of  a  greater  number 
of  assumptions.  For  likelihood  inference,  as  with  GEE,  we  require  the  number  of 
units  to  to  be  sufficiently  large  for  asymptotic  inference.  Prior  information  cannot 
be  incorporated  in  a  likelihood  analysis;  for  that,  we  need  a  Bayesian  approach.  For 
a  small  number  of  individuals,  a  Bayesian  approach  fully  captures  the  uncertainty, 
but  inference  is  completely  model-based,  and  with  a  small  number  of  individuals,  it 
is  unlikely  that  we  will  be  able  to  check  the  modeling  assumptions. 


8.11  Bibliographic  Notes 

For  descriptions  of  linear  mixed  effects  models,  see  Hand  and  Crowder  (1996, 
Chap.  5),  Diggle  et  al.  (2002,  Sects.  4.4  and  4.5),  and  Verbeeke  and  Molenberghs 
(2000).  Covariance  models  are  described  in  Verbeeke  and  Molenberghs  (2000, 
Chap.  10),  Pinheiro  and  Bates  (2000,  Chap.  5),  and  Diggle  et  al.  (2002,  Chap.  5). 
Demidenko  (2004)  provides  theory  for  mixed  models,  including  the  linear  case. 
Robinson  (1991)  provides  an  interesting  discussion  of  BLUP  estimates.  Two  early 
influential  references  on  the  LMM  from  Bayesian  and  likelihood  perspectives, 
respectively,  are  Lindley  and  Smith  (1972)  and  Laird  and  Ware  (1982). 

The  name  GEE  was  coined  by  Liang  and  Zeger  (1986)  and  Zeger  and  Liang 
(1986).  See  also  Gourieroux  et  al.  (1984)  who  considered  sandwich  estimation  for 
regression  parameters  with  a  consistent  estimator  of  additional  parameters.  Prentice 
(1988)  introduced  a  second  estimating  equation  for  estimation  of  a.  Crowder  (1995) 
points  out  that  the  existence  of  the  a  parameters  in  the  working  covariance  matrix  is 
not  guaranteed,  in  which  case  the  asymptotics  break  down.  Fitzmaurice  et  al.  (2004) 
is  an  excellent  practical  text  on  longitudinal  modeling. 
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8.1  A  Gauss-Markov  Theorem  for  Dependent  Data:  Suppose  E[Y"]  =  xf3  and 
var(Y)  =  V,  with  Y^  =  [17, . . . ,  Y^)T  and  where  Yt  =  [YlU...,  Ym  JT 
and  x  =  [*1, . . . ,  xm]T  is  N  x  (k  +  1)  with  Xi  =  [xn, . . . ,  x *„<],  Xij  = 
[1,  i, . . . ,  Xijkf ,  N  =  rii  and  j3  is  the  (k  +  1)  x  1  vector  of  regression 
coefficients. 

Consider  linear  estimators  of  the  form 

/3W  =  (xTW~1x)~1xTW~1Y, 

where  W  is  symmetric  and  positive  definite.  Show  that: 

(a)  E[3j  =/3. 

(b)  var(/3v)  <  var(/3w). 

[Hint:  In  (b),  show  that  var(/3w)  —  var(/3v)  is  positive  semi-definite.] 

8.2  Consider  the  data  in  Table  5.4  (from  Davies  1967)  that  were  presented  in 
Sect.  5.8.1.  These  data  consist  of  the  yield  in  grams  from  six  randomly  chosen 
batches  of  raw  material,  with  five  replicates  each.  The  aim  of  this  experiment 
was  to  find  out  to  what  extent  batch-to-batch  variation  was  responsible  for 
variation  in  the  final  product  yield. 

One  possibility  for  a  model  for  these  data  is  the  one-way  analysis  of  variance 
with 

V ij  —  P;  T  bj  T  Cij , 

with  j  =  1, . . . ,  n,  replicates  on  i  =  1, . . . ,  m,  batches,  bi  \  <Tq  ~ud  N(0,  <7q), 
e-ij  |  of  N(0,fTg),  with  bi  and  etj  independent. 

In  what  follows  the  following  identity  is  useful.  Let  I„  denote  the  n  x  n 
identity  matrix  and  Jn  the  n  x  n  matrix  of  l’s.  Then 

{tlln  bJn)  [I  n  ^  T*ln  J  ?  0,  Cl  ^  l lb: 

a  \  a  +  nb  J 


and 

|al„  +  bJn  |  =  an-1(a  +  nb). 

(a)  Derive  the  log-likelihood  for  /i,  af  af 

(b)  Differentiate  the  log-likelihood,  and  show  that  the  MLEs  are 

A  =  y~, 

a2e  =  MSE, 

_  (1  -  l/m)MSA  -  MSE 
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where  MSA=  ~  V-)2/(m  ~  !)  and  MSE=  E™  1  E”=i 

(Vij  -Vi-)2/[m(n-  1)]. 


[Hint:  Life  is  easier  if  the  model  is  parameterized  in  terms  of  A  = 


of  +  nof .] 


(c)  Obtain  the  form  of  var (/}..),  and  give  an  estimator  of  this  quantity. 

(d)  Find  the  REML  estimators  of  of  and  of. 

(e)  In  the  one-way  random  effects  model  with  balanced  data,  it  can  be 
shown  that 


MSA/(nof  +  of) 
MSE/of 


the  F  distribution  on  m  —  1  and  m(n  —  1)  degrees  of  freedom.  Use 
this  result  to  explain  why  F  =  MSA/MSE  may  be  compared  with  an 
Fm-i,m(n-i)  distribution  to  provide  a  test  of  Hq  :  of  =  0. 

(f)  Using  the  last  part,  show  that  the  probability  that  the  REML  estimator  of 
is  negative  is  the  probability  that  an  Fm(n_ i),(m_i)  random  variable  is 
bigger  than  1  +  nof  /  of. 

(g)  Numerically  obtain  an  MLE,  with  associated  standard  error,  for  /i. 
Additionally,  find  ML  and  REML  estimates  of  of  and  of. 

(h)  Confirm  these  estimates  using  a  statistical  package. 

8.3  Consider  the  so-called  Neymann-Scott  problem  (previously  considered  in 

Exercises  2.6  and  3.3)  in  which 


Yij  |  Hi,  a  ^ind  <J  ) 


for  i=  1, . . . ,  n,  j  =  1, 2. 

(a)  Obtain  the  MLE  for  a2,  and  show  that  it  is  inconsistent.  Why  are  there 
problems  here? 

(b)  Consider  a  REML  approach.  Assign  an  improper  uniform  prior  to  /tti , . . . , 
jj,n,  and  integrate  out  these  parameters.  Obtain  the  REML  of  cr2,  and  show 
that  it  is  an  unbiased  estimator. 

8.4  Derive  (8.25)  and  (8.26). 

[Hint:  The  identities 


are  useful.  These  follow  from 
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(E  +  F)-1E  =  I~(E  +  F)-1F 
( G  +  EFE T)~1  =  G"1  -  G~1E(ETG~1E  +  F~1)~1ETG~1, 


respectively.] 

8.5  Show  that 

var(bj  -bi)=  var(bj)  -  var(bj)  =  D  -  var (bi) 

=  D  —  Dz]V~1ziD  +  DzlV^Xiix'V-'xy'xlVr'ziD. 

8.6  Consider  the  class  of  linear  predictors  b*(y)  =  a  +  By,  where  a  and  B  are 
constants  of  dimensions  (q  +  1)  x  1  and  (q  +  1)  x  n.  Let  W  =  b  —  By,  and 
show  that 

E[(b*  -  b)TA(b*  -  6)]  =  [a  -  E( W)]TA  [a  -  E(W)]  +  tr  [Avar(W)] . 

Deduce  that  this  expression  is  minimized  by  taking  a  =  Bxf3  and  B  = 
DzTV~1.  Hence,  show  that 

DzTV~1(y  —  x/3) 

is  the  best  linear  predictor  of  b,  whatever  the  distributions  of  b  and  y. 

8.7  Prove  that  if  the  prior  distribution  for  0T  =  [(9i, . . . ,  0m]  can  be  written  as 

,,  m 

p (0)  =  I  n  p{0i  I  4>)p{4>)  #, 

i= 1 

then  the  covariances  cov(0i,  9j)  are  all  nonnegative. 

[Hint:  You  may  assume  that  E [0;  |  <f>]  =  E \6j  \  <jj\  for  i  ^  j.] 

8.8  We  return  to  the  yield  data  of  Exercise  8.2. 

(a)  Numerically  evaluate  the  formula 

bi  =  E [bi  |  yi]  =  Dz]V~1{yi  -  xS) 


in  your  favorite  package,  and  obtain  predictions  for  the  yield  data. 

(b)  Obtain  measures  of  the  variability  of  the  prediction  via  var (6*  —  b,). 

(c)  Confirm  your  predictions  using  LMM  software. 

8.9  A  Bayesian  analysis  of  the  yield  data  of  Exercise  8.2  will  now  be  performed. 
In  terms  of  the  parameters  (3o,<7^,  and  A  =  of  +  ncrg.  the  likelihood  is 


p(y  |  /30,  of,  A)  =  (27r)-nm/2((J2)-m(rl-l)/2A-m/2 


nm{y++  -  Pof  SSb 

A  +  A 


SS, 


x  exp 
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where 


SSB-n5>+  y++)  >  SSw  —  'y  XVij  Ui+)  ■ 

2=1  2=1  2=1 

Assume  the  improper  prior 

tt(/3o,  erg ,  A)  oc 

(J^  A 

(a)  Integrate  /3o  from  the  joint  posterior  p(/30,  of,  A  |  y)  to  obtain  p(of ,  A|  y). 
Show  that  this  distribution  has  the  form  of  a  product  of  independent 
inverse  gamma  distributions  with  an  additional  term  that  is  due  to  the 
constraint  A  >  of  >0. 

(b)  Obtain  the  distribution  of  p(/3q  |  of,  A,  y). 

(c)  Give  details  of  a  composition  algorithm  (as  described  in  Sect.  3.8.4)  for 
simulating  from  the  posterior  p(/?o,  of ,  A  |  y). 

(d)  Implement  the  algorithm  for  the  yield  data. 

(i)  Give  histograms  and  5%,  50%,  95%  quantile  summaries  of  the  uni¬ 
variate  posterior  distributions  for  /3o,  of ,  A,  of,  and  p=of /(of+of ). 

(ii)  Obtain  a  bivariate  scatterplot  representation  of  the  posterior  distribu¬ 
tion  p(of,  of  |  y). 

(iii)  Using  samples  from  the  distribution  for  p ,  answer  the  original 
question  concerning  the  extent  of  batch-to-batch  variability  that  is 
contributing  to  the  total  variability. 

(e)  Obtain  the  distribution  of  p(6,  |  /30,  of ,  of,  y).  Hence,  describe  an 
algorithm  for  simulating  from  the  posterior  p[bi  \  y).  Implement  this 
algorithm  for  the  yield  data,  and  give  5%,  50%,  95%  quantile  summaries 
for  p(bi  |  y),  i  =  1,. . .  ,ro. 

(f)  Now,  consider  an  alternative  computational  approach  assuming  indepen¬ 
dent  priors  with  an  improper  flat  prior  on  /i,  the  improper  prior  tt  ( of )  oc 
cr~2,  and  a  Ga(0.05,  0.01)  prior  for  <Jq2  .  Implement  a  Gibbs  sampling 
algorithm  for  sampling  from  the  conditionals: 


•  mI 

b,y 

1  /Lof, 

b,y 

1  /Lof, 

b,y 

•  b  1 

M,of  jCr2  ,y, 

where  b  =  [b 

t)  •  •  • 

Report  posterior  medians  and  90%  credible  intervals  for  //  .  of ,  of ,  b, 
and  p,  and  compare  your  answers  with  those  using  the  alternative  priors 
derived  in  the  earlier  part  of  the  question. 
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8.10  Derive  the  conditional  distributions,  given  in  Sect.  8.6.3,  that  are  required  for 
Gibbs  sampling  in  the  LMM. 

8.11  Show  the  equivalence  of  the  BLUP  predictor  b,  and  the  Gibbs  conditional 
distribution  6j  |  y.  (3 ,  a. 

8.12  Consider  the  tooth  growth  data  that  were  analyzed  in  this  chapter.  These  data 
are  available  in  the  R  package  nlme  as  Orthodont.  Let  Y^-  denote  the 
growth  (in  mm)  at  occasion  tj  (in  years)  for  boy  i,  i  =  1, . . . ,  m,  j  =  1, . . . ,  4, 
with  fi  =  8,  i2  =  10,  f3  =  12,  <4  =  14. 

(a)  Code  up  a  GEE  algorithm  with  working  independence  in  your  favorite 
package,  and  report  (3  and  var(/3). 

(b)  Using  an  available  option  in  a  statistical  package  such  as  R  confirm  the 
results  of  the  previous  part. 

(c)  Show  that  var(Y)  =  (Y  —  x(3)T(Y  —  x/3)  =  0  if  we  attempt  to  use 
sandwich  estimation  in  the  situation  in  which  cov(Y.  Y )  /  0. 

8.13  In  this  question  the  effect  of  using  different  correlation  structures,  designs, 
and  sample  sizes  in  the  GEE  approach  will  be  examined.  Let  Y.y  represent  the 
observed  growth  on  individual  i  at  time  Xij,  i  =  1, . . . ,  to;  j  =  1, . . . ,  rij.  Let 


N  =  Ei=l  ni- 

Assume  the  marginal  model  is 


E  [Yij\  =  /30  +/3iXij, 


so  that  E[Y]  =  x/3  where  Y  is  of  dimension  N  x  1,  x  is  N  x  2,  and  (3  is 
2x1.  Consider  the  estimating  function 


m 


with  working  covariances  W*  of  dimension  m  x  m ,  i  =  1, m. 

Assume  throughout  that  /?0  =  18,  /3i  =  0.5,  ai  =  1,  and  a2  is  set 
to  either  0.5  or  0.9.  Simulate  data  from  the  multivariate  normal  distribution 
Yi  ~  Nrii  (x,p.  VI),  with  the  form  of  V,  taken  as  either  the  exchangeable  or 
the  AR(1)  matrices  W,  that  are  given  below,  for  i  =  1 , ,m.  Examine  the 
efficiency  of  these  working  models  as  a  function  of: 

•  The  number  of  individuals,  with  m  =  8,  20, 60 

•  Two  designs: 

-  Design  I:  Balanced  with  rij  =  n  ~  A,  i  =  1, . . . ,  m  and  X\  =  8,  x2  =  10, 
£3  =  12,  X4  =  14  for  all  individuals 

-  Design  II:  Unbalanced  with  rij  =  n  =  3,  i  =  1, . . . ,  m  and 
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Xn  =  8,  Xi2  =  10, 
Xu  =  8,  xi2  =  10, 

X'i  1  —  8,  Xi3  —  12, 

Xi2  =  10,  Xi3  =  12, 


Xi3  —  12 

Xi3  =  14, 
Xi3  =  14, 
XiA  =  14, 


for  i  =  1, ... ,  m/4 
for  i  =  m/4+l,...,m/2 
for  i  =  m/2  +  1, . . . ,  3m/4 
for  *  =  3m/4  +  1, . . . ,  m 


•  The  working  covariance  structure  aiW;  with: 

-  Independence:  Wi  =  Ini  where  Ini  is  the  m  x  m  identity  matrix. 

-  Exchangeable:  W,  has  diagonal  elements  1  and  off-diagonal  elements  a.2- 

-  First-order  autocorrelation:  Wi  has  diagonal  elements  1  and  off-diagonal 

elements  Wijk  =  x'k\j,k  =  1, . . . ,  m,  j  ^  k,i=  1, . . . ,  m. 

In  total,  there  are  3  x  2  x  4  =  24  sets  of  simulations,  and  for  each  you  should: 

(a)  Report  the  95%  confidence  interval  coverage  for  /3i. 

(b)  Report  the  standard  errors  and  efficiencies.  For  each  working  covariance 
model,  there  are  two  standard  error  calculations;  the  “true”  standard 
errors  are  obtained  across  simulations  while  var(/3i)  describes  the  average 
(across  simulations)  of  the  reported  squared  standard  error,  where  the 
latter  is  calculated  using  the  sandwich  formula.  To  evaluate  the  efficien¬ 
cies,  the  (sandwich)  variance  of  the  estimators  under  each  of  the  working 
models  should  be  calculated. 

8.14  Crowder  and  Hand  (1990)  describe  data  on  the  body  weight  of  rats  measured 
over  64  days.  These  data  are  available  in  the  R  package  nlme  and  are  named 
BodyWeight.  Body  weight  is  measured  (in  grams)  on  day  1,  and  every  7 
days  subsequently  until  day  64,  with  an  extra  measurement  on  day  44.  There 
are  3  groups  of  rats,  each  on  a  different  diet;  8  rats  are  on  a  control  diet,  and 
two  sets  of  4  rats  are  each  on  a  different  treatment. 

(a)  Fit  LMMs  to  these  data  using  ML/REML,  with  the  primary  aim  being  to 
determine  whether  there  are  differences  in  intercepts  and  slopes  for  each 
of  the  diets.  Repeat  this  procedure  using  GEE. 

(b)  Carefully  describe  the  models  that  you  fit,  in  particular  the  choice  of 
random  effects  structure  in  the  LMM,  and  summarize  your  findings  in 
simple  terms. 

(c)  Now,  analyze  the  first  group  of  rats  using  a  Bayesian  analysis.  Specifically, 
suppose  Yij  is  the  body  weight  of  rat  i  at  time  tj,  and  consider  the  three- 
stage  model: 

Stage  One: 

Yij  =  00  +  bi  +  Pitj  +  €ij 

with  dj  |  r  N(0,  t-1),  *  =  1, . . . ,  m,  j  =  1, . . . ,  n. 
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Stage  Two:  bi  \  tq  ~ud  N(0,ro  1),  with  bi  independent  of  the  Cij, 

i= 1,  = 

Stage  Three:  Independent  hyperpriors  with: 

oc  1, 

7r(r)  oc  t-1, 

t r(r0)  ~  Ga(0. 1,0.5) 


where  f3  =  [/30,/?i]T- 

(d)  Find  the  form  of  the  conditional  distributions  that  are  required  for  con¬ 
structing  a  Gibbs  sampling  algorithm  to  explore  the  posterior  distribution 


p(l 3, 

t,  bi , 

•  •  •  ,&m,T0  |  y)\ 

p(P  1 

t,  bi, 

'  •  •  •  i  To?  y)- 

p(r  | 

PM- 

>  •  •  •  5  5  2/  )  • 

p(f- 0 

P,T, 

i^i  >  •  •  •  ?  bm->  y)  • 

p{bi  | 

P,T, 

bjj  ^i,T0,y),i=  1,.. 

. ,  m. 

(e)  Implement  this  algorithm  for  the  data  on  the  8  rats  in  the  control 
group.  Provide  trace  plots  of  selected  parameters  to  provide  evidence 
of  convergence  of  the  Markov  chain.  Report  two  sets  of  summaries, 
consisting  of  the  5%,  50%,  95%  quantiles,  from  two  chains  started  from 
different  values. 

(f)  Check  your  answers  using  available  software,  such  as  INLA  or  WinBUGS. 


8.15  Prove  (8.58). 


Chapter  9 

General  Regression  Models 


9.1  Introduction 

In  this  chapter  we  consider  dependent  data  but  move  from  the  linear  models  of 
Chap.  8  to  general  regression  models.  As  in  Chap.  6,  we  consider  generalized  linear 
models  (GLMs)  and,  more  briefly,  nonlinear  models.  We  first  give  an  outline 
of  this  chapter.  In  Sect.  9.2  we  describe  three  motivating  datasets  to  which  we 
return  throughout  the  chapter.  The  GLMs  discussed  in  Sect.  6.3  can  be  extended 
to  incorporate  dependences  in  observations  on  the  same  unit;  as  with  the  linear 
model,  an  obvious  way  to  carry  out  modeling  in  this  case  is  to  introduce  unit- 
specific  random  effects.  Within  a  GLM  a  natural  approach  is  for  these  random 
effects  to  be  included  on  the  linear  predictor  scale.  The  resultant  conditional  models 
are  known  as  generalized  linear  mixed  models  (GLMMs),  and  these  are  introduced 
in  Sect.  9.3.  In  Sects.  9.4  and  9.5  we  describe  likelihood  and  conditional  likelihood 
methods  of  estimation,  with  Sect.  9.6  devoted  to  a  Bayesian  treatment.  Section  9.7 
illustrates  some  of  the  flexibility  of  GLMMs  by  describing  and  applying  a  particular 
model  for  spatial  dependence.  An  alternative  random  effects  specification,  based 
on  conjugacy,  is  described  in  Sect.  9.8.  An  important  approach  to  the  modeling 
and  analysis  of  dependent  data  that  is  philosophically  different  from  the  random 
effects  formulation  is  via  marginal  models  and  generalized  estimating  equations 
(GEE),  and  these  are  the  subject  of  Sect.  9.9.  In  Sect.  9.10,  a  second  GEE  approach 
is  described  in  which  the  estimating  equations  for  the  mean  are  supplemented 
with  a  second  set  for  the  variances/covariances.  For  GLMMs,  extra  care  must  be 
taken  with  parameter  interpretation,  and  Sect.  9.11  discusses  this  issue,  emphasizing 
how  interpretation  differs  between  conditional  and  marginal  models.  In  Part  II  of 
the  book,  which  focused  on  independent  data,  Chap.  7  was  devoted  to  models  for 
binary  data.  For  dependent  data,  models  binary  data  are  less  well  developed,  and 
so  we  do  not  devote  a  complete  chapter  to  their  description.  However,  Sect.  9.12 
introduces  the  modeling  of  dependent  binary  data,  and,  subsequently,  Sects.  9.13 
and  9. 14  describe  conditional  (mixed)  and  marginal  models  for  binary  data.  Section 
9.15  considers  how  nonlinear  models,  as  defined  in  Sect.  6.10,  can  be  extended 
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to  the  dependent  data  case.  For  such  models,  many  applications  concentrate  on 
inference  for  units,  and  so  the  introduction  of  random  effects  is  again  suggested. 
We  refer  to  the  resultant  class  of  models  as  nonlinear  mixed  models  (NLMMs). 
Section  9. 16  considers  issues  related  to  the  parameterization  of  the  nonlinear  model. 
Inference  for  nonlinear  mixed  models  via  likelihood  and  Bayes  approaches  is 
covered  in  Sects.  9.17  and  9.18,  while  GEE  is  briefly  considered  in  Sect.  9.19.  The 
assessment  of  assumptions  for  general  regression  models  is  described  in  Sect.  9.20, 
with  concluding  comments  contained  in  Sect.  9.21.  Additional  references  appear  in 
Sect.  9.22. 


9.2  Motivating  Examples 

In  this  chapter  we  will  analyze  the  lung  cancer  and  radon  data  introduced  in 
Sect.  1.3.3  and  three  additional  datasets. 


9.2.1  Contraception  Data 

Fitzmaurice  et  al.  (2004)  reanalyze  data  originally  appearing  in  Machin  et  al.  (1988) 
concerning  a  randomized  longitudinal  contraception  trial.  Each  of  1,151  women 
received  injections  of  100  or  150  mg  of  depot  medroxyprogesterone  acetate  (DMPA) 
on  the  day  of  randomization  and  three  additional  injections  at  90-day  intervals. 
There  was  a  final  follow-up  3  months  after  the  last  injection  (a  year  after  the  initial 
injection).  The  women  completed  a  menstrual  diary  throughout  the  study,  and  the 
binary  response  is  whether  the  woman  had  experienced  amenorrhea,  the  absence 
of  menstrual  bleeding  for  a  specified  number  of  days,  during  each  of  the  four  3- 
month  intervals.  There  was  dropout  in  this  study,  but  we  will  not  address  this  issue, 
important  though  it  is.  The  sample  sizes,  across  measurement  occasions,  in  the  low- 
and  high-dose  groups  are  [576, 477, 409, 361]  and  [575, 476, 389,  353], respectively. 
Plotting  the  individual-level  0/1  data  is  usually  not  informative  for  binary  data,  and 
so  in  Fig.  9. 1 ,  we  plot  the  averages,  that  is,  the  probabilities  of  amenorrhea  over  time 
for  the  two  treatment  groups.  We  see  increasing  probabilities  of  amenorrhea  in  both 
groups,  with  the  probabilities  in  the  150-mg  dose  group  being  greater  than  in  the 
100-mg  dose  group. 

As  we  will  discuss  in  Sect.  9.14,  for  binary  data,  there  is  no  obvious  natural 
measure  of  dependence,  unlike  normal  data  for  which  the  correlation  is  routinely 
used.  However,  Table  9.1  gives  the  empirical  correlations  between  responses  at 
different  measurement  occasions  in  the  low-  and  high-dose  groups,  respectively. 
In  both  groups  there  is  appreciable  correlation  between  observations  on  the  same 
woman,  with  a  suggestion  that  the  correlations  decrease  on  measurements  taken 
further  apart.  To  explicitly  acknowledge  the  dependence  over  time  in  responses  on 
the  same  woman,  multivariate  binary  data  models  are  required. 
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Fig.  9.1  Probability  of  amenorrhea  over  time  in  low-  and  high-dose  groups,  in  the  contraception 
data 

Table  9.1  Empirical  variances  (on  the  diagonal )  and  correlations  (on  the 
upper  diagonal ),  between  measurements  on  the  same  woman  at  different 
observation  occasions  ( 1 — 4),  in  the  low-  {left)  and  high-  (right)  dose  groups 
of  the  contraception  data 


1 

2 

3 

4 

1  2 

3 

4 

1  0.15 

0.40 

0.28 

0.27 

1 

0.16  0.31 

0.25 

0.29 

2 

0.19 

0.45 

0.35 

2 

0.22 

0.43 

0.43 

3 

0.24 

0.13 

3 

0.25 

0.47 

4 

0.25 

4 

0.25 

9.2.2  Seizure  Data 


Thall  and  Vail  (1990)  describe  data  on  epileptic  seizures  in  59  individuals.  For  each 
patient,  the  number  of  epileptic  seizures  was  recorded  during  a  baseline  period  of  8 
weeks,  after  which  patients  were  randomized  to  one  of  two  groups:  treatment  with 
either  the  antiepileptic  drug  progabide  or  with  placebo.  The  numbers  of  individuals 
in  the  placebo  and  progabide  groups  were  28  and  3 1 ,  respectively.  The  number  of 
seizures  was  recorded  in  four  consecutive  2-week  periods.  For  these  data,  let  Yy 
represent  the  number  of  counts  for  patient  i,i  =  1, . . . ,  59  at  occasion  j,  with  j  =  0 
the  baseline  period  and  j  =  1, . . . ,  4  the  subsequent  set  of  four  2-week  measurement 
periods.  Also,  let  T':}  be  the  length  (in  weeks)  of  the  observation  period  (which  is  the 
same  for  all  individuals),  with  Tq  =  8  and  Tj  =  2  for  j  =  1, ...  ,4.  We  might 
consider  the  model 


Yij  |  fj,ij  ~  Poisson(/Liij) 
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o 


Fig.  9.2  Log  of  seizure  rates  by  period  for  individuals  on  placebo  and  progabide 


where  /ij;  =  Tj  exp(xjj(3)  and  exp (xij/3)  is  a  loglinear  regression  model.  There 
are  two  immediate  issues  with  this  model:  data  on  the  same  individual  are  unlikely 
to  be  independent  and  there  may  be  excess-Poisson  variation. 

As  a  first  look  at  the  data,  we  plot  the  log  seizure  rate  log[(Yjj  +0.5) /Tj\  for  each 
individual  versus  period  j  in  Fig.  9.2.  The  0.5  is  added  to  avoid  taking  the  log  of  zero. 
The  line  types  distinguish  the  placebo  and  progabide  groups.  It  is  difficult  to  discern 
much  pattern  from  this  plot.  In  particular,  it  is  not  clear  if  progabide  provides  a  drop 
in  the  rate  of  seizures,  though  there  is  clearly  large  between-individual  variability  in 
the  rates.  One  individual’s  profile  appears  to  be  outlying  and  high,  with  the  rate  of 
seizures  increasing  after  treatment  with  progabide. 

Figure  9.3  displays  the  average  seizure  counts  by  period  and  by  treatment  group. 
In  three  out  of  the  four  post-baseline  periods,  the  averages  are  lower  in  the  progabide 
group.  To  assess  the  excess  Poisson,  we  calculate  the  ratio  of  the  variance  of  the 
counts  to  the  mean,  that  is,  var(Yy  )/E[Yy  ],  by  period  and  treatment  group.  Table  9.2 
gives  these  ratios  and  clearly  shows  that  there  is  a  great  deal  of  excess-Poisson 
variability  for  these  data. 


9.2.3  Pharmacokinetics  of  Theophylline 

Twelve  subjects  were  given  an  oral  dose  of  the  antiasthmatic  agent  theophylline, 
with  1 1  concentration  measurements  obtained  from  each  individual  over  25  h.  The 
doses  ranged  between  3.10  and  5.86  mg/kg.  As  is  usual  with  experiments  such  as 
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Period 


Fig.  9.3  Average  seizure  rates  by  period  and  treatment  group 


Table  9.2  Ratio  of  the  variance  of  seizure  counts  to  the  mean  of  the  seizure 
counts,  by  period  and  treatment  group 


Period 

Group 

0 

1 

2 

3 

4 

Placebo 

22.1 

11.0 

8.0 

24.5 

7.3 

Progabide 

24.8 

38.8 

16.7 

23.7 

18.9 

this,  there  is  abundant  sampling  at  early  times  in  an  attempt  to  capture  the  absorption 
phase,  which  is  rapid.  Further  background  on  pharmacokinetic  modeling  is  given  in 
Example  1.3.4  where  the  data  for  the  first  individual  were  presented.  Section  6.2 
introduced  a  mean  model  for  these  data  (for  a  generic  individual)  as 


Dkg 

V(ka  -  ke) 


[exp(— fcex)  -  exp(— fcaa;)] 


where  x  is  the  sampling  time,  ka  >  0  is  the  absorption  rate  constant,  ke  >  0  is 
the  elimination  rate  constant,  and  V  >  0  is  the  (apparent)  volume  of  distribution 
(that  converts  total  amount  of  drug  into  concentration).  Figure  9.4  shows  the 
concentration-time  data.  The  curves  follow  a  similar  pattern,  but  there  is  clearly 
between-subject  variability. 
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Time  since  drug  administration  (hr) 


Fig.  9.4  Concentrations  versus  time  for  1 2  individuals  who  received  the  drug  theophylline 


9.3  Generalized  Linear  Mixed  Models 

In  this  section  we  describe  a  modeling  framework  that  allows  the  introduction  of 
random  effects  into  GLMs;  these  models  induce  dependence  between  responses  on 
the  same  unit.  Adding  normal  random  effects  on  the  linear  predictor  scale  gives  a 
GLMM.  The  paper  of  Breslow  and  Clayton  (1993)  popularized  these  models,  by 
discussing  implementation  and  providing  a  number  of  cases  studies. 

We  first  describe  notation.  Let  Yi3  be  the  jth  observation  on  the  ith  unit  for 
i  =  1 , ...  ,m,  j  =  1, ...  ,rii.  The  responses  for  the  /th  unit  will  be  denoted 
Yt  =  [Yu, . . . ,  Yini]T,  i  =  1, . . . ,  to.  Responses  on  different  units  will  be  assumed 
independent.  Let  (3  represent  a  (fc  + 1)  x  1  vector  of  fixed  effects  and  foj  a  {q  + 1)  x  1 
vector  of  random  effects,  with  q  <  k.  Let  x.i3  =  [1,  Xij \ , . . . ,  Xijk]  be  a  {k  +  1)  x  1 
vector  of  covariates,  so  that  s,  =  \x,  i , . . . ,  xlr,t  ]  is  the  design  matrix  for  the  fixed 
effects  of  unit  i,  and  let  zl3  =  [1,  Zjji, . . . ,  %g]T  be  a  {q  +  1)  x  1  vector  of  variables 
that  are  a  subset  of  xi3,  so  that  z*  =  [zn, . . . ,  Zini]T  is  the  design  matrix  for  the 
random  effects  of  unit  i. 
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A  GLMM  is  defined  by  the  following  two-stage  model: 

Stage  One:  The  distribution  of  the  data  is  Yv}  \  dij ,  a  ~  p(-)  where  p(-)  is  a  member 
of  the  exponential  family,  that  is 

p(Vij  I  Oij,a)  =  exp {[yijOij  -  &(%)]/a(a)  +  c(yij,a)}  ,  (9.1) 

for  *  =  1, . . . ,  m  units  and  j  =  1 , . . . .  rt, .  measurements  per  unit.  The  variance  is 

varfYy  |  Ojj ,  o:  j  —  olv  ( j.it  j ) . 

Let  Hij  =  E [Yij  |  Oij,  a ]  and,  for  a  link  function  g(-),  suppose 

[j ( // j-j )  Xij/3  T  Zijbi, 

so  that  random  effects  are  introduced  on  the  scale  of  the  linear  predictor.  This  defines 
the  conditional  part  of  the  model. 

Stage  Two:  The  random  effects  are  assigned  a  normal  distribution: 

b,  |  D~iid  Ng+r(  0,  D ). 

For  a  number  of  reasons,  including  parameter  interpretation,  it  is  important  to 
investigate  the  marginal  moments  that  are  induced  by  the  random  effects.  Since 
marginal  summaries  may  be  calculated  for  the  observed  data,  comparison  with  the 
theoretical  forms  is  useful  for  model  checking.  The  marginal  mean  is 

E[Yij]=E[Ebi(Yij\bi)] 

=  E [^j  ]  =  Eb.[g~1(xijf3  +  Zijbi)]. 


The  variance  is 

var(Fii)  =  E[var(Ky  |  bt)]  +var(E [Y^  \  b.t]  ) 

=  aEb.[  v{g~1{xijf3  +  Zyb*)}]  +  var  b.[g~1(xijf3  +  Zyb*)]. 


The  covariances  between  outcomes  on  the  same  unit  are 


co y(Yij,Yik)  =  E[cov(Yij,Yik  |  6*)]  +cov[E(Flj  |  bi],E[Yik  \  bt)] 
=  cov6.  [g-1  (Xij/3  +  z^bi),  g~1(xikf3  +  zikbt)  ] 

for  j  ^  k  due  to  shared  random  effects,  and 


co  v(Yij,Yi>k)  =  0, 
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for  i  ^  i' ,  as  there  are  no  random  effects  in  the  model  that  are  shared  by  different 
units.  Explicit  forms  of  the  moments  are  available  for  some  choices  of  exponential 
family,  as  we  see  later  in  the  chapter,  though  the  marginal  distribution  of  the  data  is 
not  typically  available  (outside  of  the  normal  case  discussed  in  Chap.  8). 


9.4  Likelihood  Inference  for  Generalized  Linear 
Mixed  Models 

As  discussed  in  Sect.  8.5,  there  are  three  distinct  sets  of  parameters  for  which 
inference  may  be  required:  fixed  effects  (3,  variance  components  a,  and  random 
effects  b  =  [hi, ... ,  bm]T.  As  with  the  linear  mixed  model  (LMM),  we  maximize 
the  likelihood  L(/3,  a),  where  a.  denote  the  variance  components  in  D  and  the  scale 
parameter  a  (if  present).  The  likelihood  is  obtained  by  integrating  [hi , . . . ,  bm\  from 
the  model: 


There  are  m  integrals  to  evaluate,  each  of  dimension  equal  to  the  number  of  random 
effects,  g  +  l.  For  non-Gaussian  GLMMs,  these  integrals  are  not  available  in  closed 
form,  and  so  some  sort  of  analytical,  numerical,  or  simulation-based  approximation 
is  required  (Sect.  3.7).  Common  approaches  include  analytic  approximations  such 
as  the  Laplace  approximation  (Sect.  3.7.2)  or  the  use  of  adaptive  Gauss-Hermite 
numerical  integration  rules  (Sect.  3.7.3).  There  are  two  difficulties  with  inference 
for  GLMMs:  carrying  out  the  required  integrations  and  maximizing  the  resultant 
(approximated)  likelihood  function.  The  likelihood  function  can  be  unwieldy,  and, 
in  particular,  the  second  derivatives  may  be  difficult  to  determine,  so  the  Newton- 
Raphson  method  cannot  be  directly  used.  An  alternative  is  provided  by  the  quasi- 
Newton  approach  in  which  the  derivatives  are  approximated  (Dennis  and  Schnabel 
1996). 

One  approach  to  the  integration/maximization  difficulties  is  the  following.  In 
Sect.  6.5.2  the  iteratively  reweighted  least  squares  (IRLS)  algorithm  was  described 
as  a  method  for  finding  MLEs  in  a  GLM.  The  penalized-IRLS  (P-IRLS)  algorithm 
is  a  variant  in  which  the  working  likelihood  is  augmented  with  a  penalization  term 
corresponding  to  the  (log  of  the)  random  effects  distribution.  This  algorithm  may 
be  used  in  a  GLMM  context  in  order  to  obtain,  conditional  on  a ,  estimates  of  /3 
and  b,  with  a  being  estimated  via  a  profile  log-likelihood  (Sect.  2.4.2);  see  Bates 
(2011).  The  P-IRLS  is  also  used  for  nonparametric  regression  and  is  described  in 
this  context  in  Sect.  1 1.5.1. 

The  method  of  penalized  quasi-likelihood  (PQL)  was  historically  popular 
(Breslow  and  Clayton  1993)  but  can  be  unacceptably  inaccurate,  in  particular,  for 
binary  outcomes.  See  Breslow  (2005)  for  a  recent  perspective. 
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Approximate  inference  for  \/3,  a]  is  carried  out  via  the  usual  asymptotic  normal¬ 
ity  of  the  MLE  which  is,  with  sloppy  notation, 


3 

OL 


I) 3/3  I  Pol 
lap  laa 


(9.2) 


where  Ipp,  Ipa,  Iap ,  and  Iaa  are  the  relevant  information  matrices.  An  important 
observation  is  that  in  general  Ipa  /  0,  and  so  we  cannot  separately  estimate  the 
regression  and  variance  parameters,  so  consistency  requires  correct  specification  of 
both  mean  and  variance  models.  Likelihood  ratio  tests  are  available  for  fixed  effects 
though  it  requires  experience  or  simulation  to  determine  whether  the  sample  size  m 
is  large  enough  for  the  null  \2  distribution  to  be  accurate. 

In  terms  of  the  random  effects,  one  estimator  is 

My  I  bi)p{bi  I  D)  dbi 

V  i  PiV  I  bi)P(bi  I  D )  dbi 

Unless  the  first  stage  is  normal,  the  integrals  in  numerator  and  denominator  will 
not  be  analytically  tractable,  though  Laplace  approximations  or  adaptive  Gauss- 
Hermite  may  be  used.  In  practice,  empirical  Bayes  estimators,  E[6,;  |  y.  /3,  S],  are 
used. 


Example:  Seizure  Data 

Recall  that  YtJ  is  the  number  of  seizures  on  patient  t  during  period  ^,  ;/  =  0, 1, 2,  3, 4, 
and  Tj  is  the  observation  period  during  period  j,  j  =  0,1, 2,  3, 4  with  To  =  8 
weeks  and  Tj  =  2  weeks  for  j  =  1, ...  ,4.  It  is  clear  from  Eig.9.2  that  there 
is  considerable  between-patient  variability  in  the  level  of  seizures,  which  suggests 
that  a  random  effects  model  should  include  at  least  random  intercepts.  A  random 
intercepts  GLMM  for  the  seizure  data  is: 

Stage  One:  Yjj  |  (3,  bi  ~ind  Poisson (pij),  with 


g{l-kj )  —  ft'ij  —  log  Tj  -)-  Xijf3  -f-  bt , 


and  where  xtj  is  the  design  matrix  for  individual  %  at  period  j,  with  associated  fixed 
effect  (3.  A  particular  model  will  be  discussed  shortly.  The  first  two-conditional 
moments  are 


L  [  Yjj  |  />,]  ji',j  —  I  ’/j  exp(a?2j/3  -t-  bj ) . 

var (Yij  |  bi)  =  ynj. 
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Table  9.3  Parameter  interpretation  for  the  model  defined  by  (9.3) 


Group 

Period  0 

Period  1,2, 3,4 

Placebo 

exp(/30) 

exp(/30  +  /32) 

Progabide 

exp(/3o  +  pi) 

exp(/3o  +  /3i  +  +  (83) 

Stage  Two:  bi  \  <7%  ~ud  N(0,CTq). 
Writing  a.  =  0q,  the  likelihood  is 


m  n  rii 


L(J3,a)  =  J[j  [] 

*=i J  i=x 


exp  [-Hij(bj)\nij{bi) 
Vir 


X (2^0)  1/2  exp  (-^) 


Vij 


dbi 


=  (2tt o-q)  -/2IIexP 

i=l 


m  rii 

exp  |  -ebi  ^2  eXij/3  +  ^ 


2/iA 


j'=i 


j=i 


dbi. 


The  latter  integral  is  analytically  intractable.  A  Laplace  approximation  would 
expand  each  of  the  m  integrands  about  the  maximizing  value  of  bi,  or,  alternatively, 
numerical  integration  can  be  used,  for  example,  using  adaptive  Gauss-Hermite. 

Let  Xu  =  0/1  if  patient  i  was  assigned  placebo/progabide,  X2j  =  0/1  if 
j  =  0/1,  2, 3, 4,  and  =  Xu  x  a :2j  for  j  =  0,1, 2,  3, 4.  Therefore,  xi  is  a 
treatment  indicator,  X2  is  an  indicator  of  pre-/post-baseline,  and  a; 3  takes  the  value 
1  for  progabide  individuals  who  are  post-baseline  and  is  zero  otherwise.  The  first 
model  we  fit  is 

Xijf3  =  fa  +  Pi  Xli  +  p2X2j  +  P3%3ij,  (9.3) 

so  that  Xi  j  is  1  x  4  and  f3  is  4  x  1.  Table  9.3  summarizes  the  form  of  the  model 
across  groups  and  periods. 

We  first  provide  an  interpretation  from  a  conditional  perspective.  In  the  following 
interpretation,  a  “typical”  patient  corresponds  to  a  patient  whose  random  effect  is 
zero,  that  is,  6  =  0.  On  the  more  interpretable  rate  scale: 

•  exp(/?o)  is  the  rate  of  seizures  for  a  typical  individual  under  placebo  in  time 
period  0. 

•  exp(pi)  is  the  ratio  of  the  seizure  rate  of  a  typical  individual  under  progabide  to 
a  typical  individual  under  placebo,  in  time  period  0.  If  the  groups  are  comparable 
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at  the  time  of  treatment  assignment  and  there  are  no  other  corrupting  factors,  we 
would  expect  this  parameter  to  be  estimated  as  close  to  1 . 

•  exp(/?2)  is  the  ratio  of  the  seizure  rate  post-baseline  ( Tj ,  j  =  1,2, 3, 4)  as 
compared  to  baseline  (To),  for  a  typical  individual  in  the  placebo  group. 

•  exp(/?3)  is  the  ratio  of  the  seizure  rate  for  a  typical  individual  in  the  progabide 
group  post-baseline,  as  compared  to  a  typical  individual  in  the  placebo  group  in 
the  same  period.  Hence,  exp  (#3)  is  the  rate  ratio  parameter  of  interest. 

Alternatively,  we  may  interpret  these  rates  and  ratios  of  rates  as  being  between  two 
individuals  with  the  same  baseline  rate  of  seizures  (i.e.,  the  same  random  effect  b ) 
prior  to  treatment  assignment. 

We  now  evaluate  the  implied  marginal  model.  We  recap  the  first  two  moments  of 
a  lognormal  random  variable.  If  Z  ~  LogNorm(/i.  a2),  then 

E  [Z\  =  exp  (/j,  +  a2  /  2) 
var(Z)  =  exp(2/i i  +  cr2)  x  [exp(cr2)  —  1] 

=  E[Z]2  x  [exp(cr2)  —  1]. 

Therefore,  since  exp(6j)  ~  LN(0,  CTq),  the  marginal  mean  is 


E[Yij]=Ebi[E(Yij\bi)} 

=  Tij  exp(xijf3)Eb.  [exp(6j)  ] 
=  exjp(xij(3  +  Uq/2). 


Consequently,  for  this  model,  relative  rates  exp (/?*,),  k  =  1,2,3  (which,  recall, 
are  ratios)  have  a  marginal  interpretation,  since  the  exp((jQ/2)  terms  cancel  in 
numerator  and  denominator  (under  the  model).  For  example,  exp(/3i)  is  the  ratio 
of  the  average  seizure  rate  in  the  progabide  group  to  the  average  rate  in  the  placebo 
group,  in  time  period  0.  Further  discussion  of  parameter  interpretation  in  marginal 
and  conditional  models  is  provided  in  Sect.  9.11.  The  marginal  variance  is 


var (Yij)  =  E„.[var (Yjj  |  b,)]  +  var6.  [E(Yy  |  6*)] 

=  E6.  [ Ttj  exp(xijf3  +  bi)]+  varb.  [ Ttj  exp(xy  f3  +  h)] 
=  E^-H1  +  E(Yij )][exp(o-g)  -  1)  ] 

=  E[Yij][l  +  E{Yij)  x  k] 


where 


k  =  exp(o-Q)  —  1  >  0 

illustrating  quadratic  excess-Poisson  variation  which  increases  as  cr2  increases. 
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The  marginal  covariance  between  observations  on  the  same  individual  is 

co v{Yij,Yik)  =  cov  [Tij  exp(xij/3  +  6* ),  exp(xik(3  +  fe*)] 

=  TijTik  exp(xjj/3  +  Xifc/3)  x  exp  (of)  [exp  (of)  -  1] 
=  E[Yl3]E[Ylk\K. 

To  summarize,  for  individual  i,  the  variance-covariance  matrix  is 


Mil  +  MnA 
Mi2MilK 
Mi3MilK 
Mi4Mil^ 


Mi  iMi2« 
Mi2  +  M?2K 
Mi3Mi2« 
Mi4Mi2« 


MilMi3« 
Mi2Mi3^ 
Mi3  +  M?3K 


Mi4  +  Mi4K- 


For  a  random  intercepts  only  LMM  the  marginal  correlation  is  constant  within  a  unit 
with  correlation  of /(of  +  of),  regardless  of  MijiMik-  In  contrast,  for  the  Poisson 
random  intercepts  mixed  model,  the  marginal  correlation  is 


corr(Yy ,  life) 


^\/  Mif  Mik 

\/(l  +  KMii)(l  +  KMife) 


1  1 

k2  MiiMifc 


-1/2 


(9.4) 


so  that  correlations  vary  in  a  complicated  fashion  as  a  function  of  the  mean 
responses.  However,  the  correlations  increase  as  n  increases  and  as  the  means 
increase.  A  deficiency  of  this  model  is  that  we  have  only  a  single  parameter  (of) 
to  control  both  excess-Poisson  variability  and  the  strength  of  dependence  over  time. 
We  address  this  in  Sect.  9.6  by  adding  a  second  random  effect  to  the  model.  For 
observations  on  different  individuals,  cov(Y).,-,  Y)/fc)  =  0  for  i  ^  i' . 

Using  a  Laplace  approximation  to  evaluate  the  integrals  that  define  the  likeli¬ 
hood,  we  obtain  the  estimates  and  standard  errors  given  in  Table  9.4.  An  alternative 
approach  using  Gauss-Hermite  with  50  points  to  evaluate  the  integrals  gave  the 
same  answers,  so  we  conclude  that  the  Laplace  approximation  is  accurate  in  this 
example. 

In  terms  of  the  parameter  of  interest  fa,  there  is  an  estimated  drop  in  the 
seizure  rates  of  10%  in  the  progabide  group  as  compared  to  placebo,  but  this  drop 
is  not  significant  when  assessed  using  a  likelihood  ratio  test  under  conventional 
significance  levels.  The  estimated  value  of  fa  indicates  that  the  placebo  and 
progabide  groups  are  comparable  at  baseline,  though  the  value  of  fa  and  its  standard 
error  suggest  there  is  some  evidence  that  the  rate  of  seizures  increased  in  the  placebo 
group  after  randomization. 

The  random  intercepts  standard  deviation  is  estimated  as  of  =  0.61  to  give 
k  =  0.84.  For  an  individual  whose  rate  of  seizures  is  constant  over  the  study  period 
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Table  9.4  MLEs  and 
standard  errors  from  a 
generalized  linear  mixed 
model  fit  to  the  seizure  data 


Estimate 

Std.  err. 

00 

1.03 

0.15 

01 

-0.024 

0.21 

02 

0.11 

0.047 

P  3 

-0.10 

0.065 

0.78 

- 

Parameter  meaning:  /3o  is  the  log  baseline  seizure  rate  in 
the  placebo  group  for  a  typical  individual;  0\  is  the  log 
of  the  ratio  of  seizure  rates  between  typical  individuals  in 
the  progabide  and  placebo  groups,  at  baseline;  02  is  the 
log  of  the  ratio  of  seizure  rates  of  typical  individuals  in  the 
post-baseline  and  baseline  placebo  groups;  03  is  the  log  of 
the  ratio  of  the  seizure  rate  for  a  typical  individual  in  the 
progabide  group  as  compared  to  a  typical  individual  in  the 
placebo  group,  post-baseline;  op  is  the  standard  deviation 
of  the  random  intercepts 


at  levels  /Mj  =  /i a-  =  1,2,5,  the  correlations  between  responses  on  this  individual, 
from  (9.4),  are  estimated  as  0.46,  0.63,  0.81.  We  conclude  that  the  correlations  are 
appreciable. 


9.5  Conditional  Likelihood  Inference  for  Generalized 
Linear  Mixed  Models 

An  alternative  approach  to  estimation  in  the  GLMM  is  provided  by  conditional 
likelihood  (Sect.  2.4.2).  The  basic  idea  is  to  split  the  data  into  components  t\ 
and  t2  in  such  a  way  that  t  \  contains  information  on  parameter  of  interests, 
while  *2  contains  information  primarily  on  nuisance  parameters.  In  a  GLMM 
setting,  the  aim  is  to  condition  on  a  part  of  the  data  that  eliminates  the  random 
effects,  hence  avoiding  both  the  need  for  their  estimation  and  the  need  to  specify 
their  distribution.  A  consequence  of  the  conditioning  is  that  we  also  eliminate  all 
regression  coefficients  in  the  model  that  are  associated  with  covariates  that  are 
constant  within  an  individual. 

We  now  work  through  the  details  and  assume  a  discrete  GLMM  and  a  canonical 
link  function  so  that 

=  Oij  =  fFxlj  +  blzjj. 

We  further  assume  a  =  1,  as  is  true  for  Poisson  and  binomial  models.  Viewing  both 
f3  and  b  as  fixed  effects  gives,  from  (9.1), 


p r(y  1/3 ,  6)  oc  exp 


m  rti  m  ti  i  m  ni 

:i  xhytj  ■  X! zL  6(6*v) 

2—1  j= 1  2=1  j= 1  2=1  j=  1 


(9.5) 


438 


9  General  Regression  Models 


where  b'(0ij )  =  E [Yy  |  b,].  Define 


m 

tli  =  xijVij 
3=  1 


^2i  —  ^  Zbj  i 
i=i 


and  let  =  [in, . . . ,  flm]T  and  <2  =  [t2i ,  ••• ,  t2TO]T  so  that  ti  and  <2*  are  sufficient 
statistics  for  (3  and  b,,  respectively.  Conditioning  on  the  sufficient  statistics  for  b,, 
we  obtain 


Pr 


Pl'  (yu  E"ii  ZijYij  =  t2i  I  (3,  b*) 
Pr  (E"=i  =  *2i  |  /3, 6i) 

Esu  exp(/3Tfii  +  bit2i ) 

E s2i  exp(/3TxEyy  +  blf2i) 7 


so  that  the  conditional  likelihood  is 


Egl,  exp(/3Tfii) 

Es2i  exp(/3Ta;Ey,j) 7 


where 


Su  = 


rii  | 

?/i  I  y  '  xijVij  -  till  y  ]  znVij  :  ^2i  / 

i=i  j=i  J 


S2l  = 


y<  I  1 

3=  1 


The  set  S|j  denotes  the  possible  outcomes  V ^ ,  j  =  1, . . . ,  n,;  that  are  consistent  with 
tii  and  t2j,  given  a;,  and  2,.  The  conditional  MLE  has  the  usual  properties  of  an 
MLE.  In  particular,  under  regularity  conditions,  it  is  consistent  and  asymptotically 
normally  distributed  with  the  variance-covariance  matrix  determined  from  the 
second  derivatives  of  the  conditional  log-likelihood. 

The  conditional  likelihood  approach  allows  the  specification  of  a  model  (via  the 
parameters  b?)  to  acknowledge  dependence  but  eliminates  these  parameters  from 
the  model.  We  emphasize  that  no  distribution  has  been  specified  for  the  bj,  as 
they  have  been  viewed  as  fixed  effects.  Depending  on  the  structure  of  Xij  and 
Zij ,  some  of  the  (3  parameters  may  be  eliminated  from  the  model.  For  example, 
if  =  z^,  the  collections  Su  and  S2i  coincide  and  the  complete  (3  vector  would 
be  conditioned  away. 
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Example:  Seizure  Data 

We  derive  the  conditional  likelihood  in  this  example,  using  the  random  intercepts 
only  model  so  that  zy  bi  =  bi.  The  loglinear  random  intercept  model  is 

logE [Yij  |  (3*,  A i]  =  log  Tij  +  Xi  +  /32x2j  + 

=  log  +  A  i  +Xij(3 * 

=  log  m  j 


where  (3*  =  [(32,  @3 ]T  represents  the  regression  coefficients  that  are  not  conditioned 
from  the  model  (since  they  are  associated  with  covariates  that  change  within  an 
individual),  Xij  =  [x2  j,#3y],  and  A  ,  =  /3o  +  @i%u  +  bi.  We  cannot  estimate  /?  1 
because  the  associated  covariate  Xu  is  a  treatment  indicator  and  constant  within  an 
individual  in  this  study;  hence,  it  is  eliminated  from  the  model  by  the  conditioning, 
along  with  bi  and  /?o-  This  parameter  is  not  a  parameter  of  primary  interest,  however. 

To  derive  the  conditional  likelihood,  we  first  write  cjj1  =  Jlj=0  t/y !,  and  then 
the  joint  distribution  of  the  data  for  the  zth  individual  is 


p(Vi  I  /3*,  A i)  =  c\i  exp 


^2  yij  log  pij  -  Y 


Pij 


ti=o 


j=o 


=  Cu  exp 


A iVi+  +  Y  y«(i°giii  +  xaP*)  ~  Vi+ 

3= 0 


In  this  case,  the  conditioning  statistic  is  yi+,  and  its  distribution  is  straightforward 
to  derive 

Yi+  |  (3*,Xi  ~  Poisson (/tii+). 

Letting  c 2/  =  yi+\,  and  recognizing  that  )il+  =  exp(Aj)  X^=0  Tij  exp(jcy/3*), 
gives 


m 

P(Vi+  I  /3*,  A,)  =  c2i  exp  {-m+  +  yi+  log  m+ ) 

i= 1 


m 

=  C2i  exp 
i=  1 


A  iUi+  +  yi+  log 


exp(xij(3*) 


-  Pi+ 


p(y i  I  13*,  Aj) 
p(Vi+  I  /3*,  Ai) 


Hence, 


piyi  I  y*+,/3*) 
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simplifies  to 


p(Vi  I  Vi+,P*) 


Cl  i 

—  exp 
C2i 


'52yij{l°gTij+xij(3*)-yi+  log 
i=o 


exp(xijP* 


4  ( 

r  \i 

-  |  |  exp  \tjij 

log  Tij+Xijl 3*- log  ^2  Tij  exp(xij(3*)\ 

i  ■  /-» 

J=0  ^ 

L  v=°  /  J 

yi+'-  A  / 

(  Tij  exp (xijfi*)  \  J  J 

_J  O 

^  .11 

S 

o 

II 

which  is  a  multinomial  likelihood  (we  have  conditioned  a  set  of  Poisson  counts  on 
their  total  so  this  is  no  surprise).  More  transparently, 


Uij  |  y*+,/3*  ~  MultinomiaU (y*_|_ , 7r j ) 


where  n,  =  [7^0,  ■  ■  • ,  7ri4]T  and 

_  Tjj  exp(xijf3*) 

‘J  J2i= o  Tu  exp (xiifi*) 

Since  xn  =  Xi2  =  x^  =  Xi4  and  T io  =  8  =  Tij>  we  effectively  have  two 

observation  periods  of  equal  length.  Letting  Y*  =  j> 

I  yi+,/3*  Binomial(yi+,7r*) 


where  the  odds  are  such  that 

7r*  _  J  exp(/32)  i  =  1, . . . ,  28,  placebo  group 

1  —  7r*  \  exp(/32  +  /?3)  i  =  29, . . . ,  59,  progabide  group. 

Hence,  fitting  can  be  simply  performed  using  logistic  regression.  For  the  seizure 
data,  the  sum  of  the  denominators  are  1,825  and  1,967  for  placebo  and  progabide 
with  963  and  987  total  seizures  in  the  post-treatment  period.  These  values  result 
in  estimates  (standard  errors)  of  /?2  =  0.11  (0.047)  and  fa  =  —0.10  (0.065).  The 
estimate  suggests  a  positive  effect  of  progabide,  but  the  difference  from  zero  is  not 
significant.  Performing  Fisher’s  exact  test  (Sect.  7.7)  makes  little  difference  for  these 
data  since  the  counts  are  large. 

The  conditional  likelihood  approach  is  quite  intuitive  in  this  example  and  results 
in  a  two-period  design  in  which  each  person  is  acting  as  their  own  control. 
Conditioning  on  the  sum  of  the  two  counts  results  in  a  single  outcome  per  patient 
and  removes  the  need  to  confront  the  dependency  issue. 


9.6  Bayesian  Inference  for  Generalized  Linear  Mixed  Models 


441 


9.6  Bayesian  Inference  for  Generalized  Linear 
Mixed  Models 

9.6.1  Model  Formulation 

A  Bayesian  approach  to  inference  for  a  GLMM  requires  a  prior  distribution  for 
(3,  a.  As  with  the  linear  mixed  model  (Sect.  8.6),  a  proper  prior  is  required  for  the 
matrix  D.  A  proper  prior  is  not  always  necessary  for  (3,  but  care  is  required.  The 
exponential  family  and  canonical  link  lead  to  a  likelihood  that  is  well  behaved  (in 
particular,  with  respect  to  tail  behavior),  though  it  is  safer  to  specify  a  proper  prior 
since  impropriety  of  the  posterior  can  occur  in  some  cases  (e.g.,  with  noncanonical 
links  or  when  counts  are  either  equal  to  zero  or  to  the  denominator;  see  Sect.  6.8.1). 
As  with  the  LMM,  closed-form  inference  is  unavailable,  but  MCMC  (Sect.  3.8) 
is  almost  as  straightforward  as  in  the  LMM,  and  the  integrated  nested  Laplace 
approximation  approach  (Sect.  3.7.4)  is  also  available  though  the  approximation  is 
not  always  accurate  for  the  GLMM  (Fong  et  al.  2010). 

Let  W  =  D  l ,  and  assume  that  there  are  no  unknown  scale  parameters  at  stage 
one  of  the  model  (i.e.,  a  =  1),  as  is  the  case  for  binomial  and  Poisson  models.  The 
joint  posterior  is 


P(J3,  W,b\y)<x]l  ]p{Vi  |  (3,  bi)p{bi  I  W)\  n((37  W). 

i= 1 

We  assume  independent  hyperpriors: 

/3~N9+1(/3q,V0) 

W  ~  Wishg+i(r,  R-1) 

where  Wishg+i(r,  R _1)  denotes  a  Wishart  distribution  of  dimension  q  +  1  with 
degrees  of  freedom  r  and  scale  matrix  i?-1;  see  Sect.  8.6.2  for  further  discussion. 
The  conditional  distribution  for  W  is  unchanged  from  the  LMM  case.  There  are  no 
closed-form  conditional  distributions  for  (3,  or  for  6; ,  but  if  an  MCMC  approach  is 
followed,  Metropolis-Hastings  steps  can  be  used. 


9.6.2  Hyperpriors 

In  a  GLMM  we  can  often  specify  priors  for  more  meaningful  parameters  than  the 
original  elements  of  (3.  For  example,  exp(/3)  is  the  relative  risk/rate  in  a  loglinear 
model  and  is  the  odds  ratio  in  a  logistic  model.  It  is  convenient  to  specify  lognormal 
priors  for  a  generic  parameter  9  >  0,  since  one  may  specify  two  quantiles  of 
the  distribution,  and  directly  solve  for  the  two  parameters  of  the  prior.  Denote 
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by  LogNorm(/r,  a)  the  lognormal  prior  distribution  for  9  with  E[log0]  =  fj,  and 
var(log0)  =  <r2,  and  let  9\  and  62  be  the  <71  and  (72  quantiles  of  this  prior.  Then, 
(3.15)  and  (3.16)  give  the  lognormal  parameters.  As  an  example,  in  a  Poisson  model, 
suppose  we  believe  there  is  a  50%  chance  that  the  relative  risk  is  less  than  1  and  a 
95%  chance  that  it  is  less  than  5.  With  <71  =  0.5,  9\  =  1.0  and  172  =  0.95,  $2  =  5.0, 
we  obtain  lognormal  parameters  /r  =  0  and  a  =  log(5/1.96)  =  0.98. 

Consider  the  random  intercepts  model  with  bi  \  a2  ~ud  N(0,o-q).  It  is  not 
straightforward  to  specify  a  prior  for  00,  which  represents  the  standard  deviation  of 
the  residuals  on  the  linear  predictor  scale  and  is  consequently  not  easy  to  interpret. 
We  specify  a  gamma  prior  Ga(o,  b)  for  the  precision  To  =  1  /<7g,  with  parameters 
a,  b  specified  a  priori.  The  choice  of  a  gamma  distribution  is  convenient  since  it 
produces  a  marginal  distribution  for  the  “residuals”  in  closed  form.  As  discussed  in 
Sect.  8.6.2,  the  marginal  distribution  for  bi  is  td( 0,  A2),  a  Student’s  t  distribution 
with  d  =  2a  degrees  of  freedom,  location  zero,  and  scale  A2  =  b/a.  These 
summaries  allow  prior  specification  based  on  beliefs  concerning  the  residuals  on 
a  natural  scale. 

As  an  example,  consider  a  log  link,  in  which  case  the  above  prior  specification 
is  equivalent  to  the  residual  relative  risks  following  a  log  Student’s  t  distribution. 
We  specify  the  range  exp(±V)  within  which  we  expect  the  residual  relative  risks 
to  lie  with  probability  q  and  use  the  relationship  ±t%2A  =  ±V,  where  is  the  17th 
quantile  of  a  Student’s  t  random  variable  with  d  degrees  of  freedom,  to  give  a  = 
d/2,  b  =  V2d/2(t/1/2)2 .  For  example,  if  we  assume  a  priori  that  the  residual  relative 
risks  follow  a  log  Student’s  t  distribution  with  2  degrees  of  freedom  and  that  95% 
of  these  risks  fall  in  the  interval  [0.5, 2.0],  then  we  obtain  the  prior,  Ga(l,  0.0260). 
In  terms  of  op,  this  results  in  [2. 5%, 97.5%]  quantiles  of  [0.084,1.01]  with  posterior 
median  0.19. 

It  is  important  to  assess  whether  the  prior  allows  all  reasonable  levels  of 
variability  in  the  residual  relative  risks,  in  particular,  small  values  should  not  be 
excluded.  The  prior  Ga(0. 001, 0.001),  which  has  been  widely  used  under  the  guise 
of  being  relatively  non-informative,  should  be  avoided  for  this  reason.  This  prior 
corresponds  to  the  relative  risks  following  a  log  Student’s  t  distribution  with  0.002 
degrees  of  freedom,  so  that  the  spread  is  enormous.  For  example,  the  0.01  quantile 
for  (Tq  is  6.4  so  that  it  is  unlikely  a  priori  that  the  standard  deviation  is  small. 


Example:  Seizure  Data 

For  illustration,  we  consider  three  models  for  the  seizure  data: 

Model  1 :  The  conditional  mean  model  we  start  with  has  stages  one  and  two 
given  by: 


Yij  |  bi  ~ind  PoissonpV,-  exp(a:^/3  +  bi)] 
bi  |  (Tq  ~ Ud  N(0,(7q). 


(9.6) 
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For  a  Bayesian  analysis,  we  require  priors  for  (3  and  a f.  In  this  and  the  following 
two  models,  we  take  the  improper  prior  n(/3)  oc  1.  We  assume  a q2  =  tq  ~ 
Ga(l,  0.260).  This  prior  corresponds  to  a  Student’s  t2  distribution  for  the  residual 
rates  with  a  95%  prior  interval  of  [0.5, 2.0]. 

Model  2:  We  assume  the  same  first  and  second  stages  as  model  1  but  address  the 
sensitivity  to  the  prior  on  tq.  Specifically,  we  perturb  the  prior  to  r(t  ~  Ga(2, 1.376), 
which  corresponds  to  a  Student’s  t±  distribution  for  the  residual  rates  with  a  95% 
interval  [0.1, 10.0]. 

Model  3:  As  pointed  out  in  Sect.  9.4,  a  Poisson  mixed  model  with  a  single  random 
effect  has  a  single  parameter  <to  only  to  model  excess-Poisson  variability  and 
within-individual  dependence.  Therefore,  we  introduce  “measurement  error”  into 
the  model  via  the  introduction  of  an  additional  random  effect  in  the  linear  predictor. 
To  motivate  this  model,  consider  the  random  intercepts  only  LMM  model: 

E [T/j  |  bj\  Xijf3  T  bi  -T 

k  |  of  ~  N(0,  of) 

eij  |  of  ~  N(0,  of), 

with  bi  and  e  ,:j  independent.  By  analogy,  consider  the  model: 


~^ij  |  i  € ij  r 

-  Poisson [Xf  exp(xij/3  +  bz+  e^)] 

hi\alr 

^  N(0,  of) 

e ij  1  °f  " 

^  N(0,  of) 

with  bi  and  et:j  independent.  There  are  now  two  parameters  to  allow  for  between- 
individual  variability,  of,  and  within-individual  variability,  of  (with  both  producing 
excess-Poisson  variability).  Unfortunately,  there  is  no  simple  marginal  interpretation 
of  of  and  of  since 

E [Yij\  =  inj  =  Tij  exp (xij/3  +  <Tq/2  +  of /2) 
var {Yij)  =  ni:j{  1  +  ^[exp (of)  -  l][exp(cr£2)  -  1]} 
co v(Yij,Yik)  =  TijTik  exp[(a +  xik)f3 }  exp(o-o)[exp(o-o)  -  1], 

The  expression  for  the  marginal  covariance  shows  that  of  is  controlling  the  within- 
individual  dependence  in  the  model,  with  large  values  giving  high  dependence.  The 
expression  for  the  marginal  variance  is  quadratic  in  the  mean  and  is  controlled  by 
both  of  and  of,  with  large  values  corresponding  to  greater  excess-Poisson  variabil¬ 
ity.  We  assign  independent  priors  ~  Ga(l,  0.260),  a~2  ~  Ga(l,  0.260). 

All  three  models  were  implemented  using  MCMC.  Table  9.5  gives  summaries  for 
the  three  models.  Model  1  gives  very  similar  inference  to  the  likelihood  approach 
described  in  Sect.  9.4  (specifically,  the  result  presented  in  Table  9.4),  which  is  not 
surprising  given  the  relatively  large  sample  size  and  weak  priors.  Model  2  shows 
little  sensitivity  to  the  prior  distribution  on  ctq  which  is  again  not  surprising  given 
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Table  9.5  Posterior  means  and  standard  deviations  for  Bayesian  analyses  of  the  seizure  data 


Model  1 

Model  2 

Model  3 

Estimate 

Std.  err. 

Estimate 

Std.  err. 

Estimate 

Std.  err. 

A> 

1.03 

0.16 

1.04 

0.16 

1.04 

0.18 

0i 

-0.036 

0.21 

-0.030 

0.22 

0.062 

0.25 

02 

0.11 

0.047 

0.11 

0.047 

0.0064 

0.10 

03 

-0.10 

0.065 

-0.10 

0.065 

-0.29 

0.14 

0.80 

0.078 

0.81 

0.077 

0.82 

0.084 

<7e 

- 

- 

- 

- 

0.39 

0.033 

See  the  caption  of  Table  9.4  for  details  on  parameter  interpretation.  Models  1  and  2  are  standard 
GLMMs  and  differ  only  in  the  priors  placed  on  op  which  is  the  standard  deviation  of  the 
random  intercepts.  Model  3  adds  an  additional  measurement  error  random  effect,  with  standard 
deviation  o£ 


the  number  of  individuals.  Model  3  shows  substantive  differences,  however.  The 
parameter  of  interest  (3 3  is  now  greatly  reduced,  with  a  95%  credible  interval  for 
the  rate  being  [0.56,0.99].  The  reason  for  the  change  is  that  in  the  progabide  group, 
there  is  a  single  individual  (as  seen  in  Fig.  9.2)  who  is  very  influential;  this  individual 
has  counts  of  151,  102,  65,  72  and  63  in  the  five  time  periods.  The  introduction  of 
measurement  error  accommodates  this  individual.  The  posterior  medians  of  e,j  for 
this  individual  show  a  negative  error  term  at  baseline,  followed  by  a  run  of  positive 
terms  post-baseline:  —0.61,  0.61,  0.17,  0.27,  0.14.  The  difference  in  signs  explains 
why  the  between-individual  random  effect  cannot  accommodate  this  individual’s 
data.  Notice  also  that  (the  log  ratio  of  seizure  rates  in  the  post-baseline  period 
relative  to  the  baseline  period,  for  typical  individuals  in  the  placebo  group)  is  now 
close  to  zero,  whereas  in  models  1  and  2,  it  is  0.11.  This  shows  that  the  aberrant 
individual’s  measurements  were  responsible  for  the  high  value  of  P2  in  the  first  two 
models.  The  estimate  for  <re  is  less  than  half  the  estimate  for  ao  so  that  between- 
individual  variability  is  greater  than  within-individual  variability  for  these  data. 

In  analyses  presented  in  Diggle  et  al.  (2002),  the  influential  individual  was 
dropped,  and  in  their  Table  9.7,  the  single  random  effect  analysis  produced  an 
estimate  (standard  error)  of  —0.30  (0.070),  which  is  very  similar  to  that  for  model 
3.  We  would  always  prefer  to  not  remove  individuals  from  the  analysis,  however, 
unless  there  are  substantive  reasons  to  do  so. 

Another  possibility  for  modeling  excess-Poisson  variability,  by  combining  the 
Poisson  likelihood  with  a  gamma  random  effects  distribution,  is  considered  in 
Sect.  9.8.  □ 

In  the  last  example  we  saw  that  the  introduction  of  normal  random  effects 
accounted  for  both  measurement  error  and  between-individual  variability.  This 
flexibility  is  a  great  benefit  of  the  GLMM  framework.  One  way  of  approaching 
modeling  is  to  first  imagine  that  the  response  is  continuous  and  then  decide  upon 
a  model  that  would  be  considered  in  this  case.  The  same  structure  can  then  be 
assumed  for  the  data  at  hand  but  on  the  linear  predictor  scale.  In  the  next  example, 
the  versatility  is  further  illustrated  with  a  model  for  spatial  dependence. 
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9.7  Generalized  Linear  Mixed  Models  with  Spatial 
Dependence 

9.7.1  A  Markov  Random  Field  Prior 

The  topic  of  modeling  residual  spatial  dependence  is  vast  and  here  we  only  scratch 
the  surface  and  present  a  model  that  is  popular  in  the  spatial  epidemiology  literature, 
and  fits  within  the  GLMM  framework.  We  first  describe  the  model  and  then  illustrate 
its  use  on  the  lung  cancer  and  radon  data  of  Sect.  1.3.3. 

The  following  three-stage  model  was  introduced  by  Besag  et  al.  (1991)  in  the 
context  of  disease  mapping: 


Stage  One:  The  distribution  of  the  response  in  area  i  is 

Px  |  pi,  6j,  Si  ^ ind  Poisson[E)^x^  6xp(e^  T  S^)] 
with  loglinear  mean  model 


log^i  =  /30  +  fiiXi,  (9.7) 

where  Xi  is  the  radon  level  in  area  i.  The  random  effects  e,  and  Si  represent  error 
terms  without  and  with  spatial  structure,  respectively.  We  have  already  encountered 
the  nonspatial  version  when  a  Poisson-Gamma  model  was  described  for  these  data 
in  Chap.  6.  There  are  many  models  one  might  envision  for  the  spatial  terms  Si,  i  = 
1 ,...  ,m.  An  obvious  isotropic  form  would  be  S'  =  [Si, ,  Sm]T  ~  Nm(0,  a 2SR) 
with  R  a  correlation  matrix  with  /?,„;/  describing  the  correlation  between  areas  i 
and  i' ,  i,i'  =  1, . . . ,  m.  A  common  form  is  Rw  =  pdii'  where  dw  is  the  distance 
between  the  centroids  of  areas  i  and  i' .  We  have  already  seen  this  form  of  correlation 
in  the  context  of  longitudinal  data;  see  in  particular  (8.14). 

Marginally,  this  model  gives 

E  [Yj]  =  Eipi  exp  (of /2  +  of/  2) 
var (Yi)  =  E [Y,]  {l  +  E[l/][exp(of )  -  l][exp(of )  -  1)]} 
co \{Yi,Yv)  =  EimEvHv  exp(og)[exp(CTg)  -  1], 

This  isotropic  model  is  computationally  expensive  within  an  MCMC  scheme 
because  we  need  to  invert  R  at  each  iteration  to  obtain  the  conditional  distribution. 
We  describe  an  alternative  which  is  both  computationally  feasible  and  statistically 
appealing. 
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Stage  Two:  The  random  effects  distributions  are 


e*  1  ct2  ~iid  N(0,  a2) 

(9.8) 

Si> ,  i!  e  n e(i),a2  ~ind  N 

(9.9) 

where  S',;  =  T_  £T,gn e(i)  5V  is  the  mean  of  the  "neighbors”  of  area  i,  with 
ne(i)  defining  the  set  of,  and  n,  the  number  of,  such  neighbors.  This  intrinsic 
conditional  autoregressive  (ICAR)  model  is  very  appealing  since  it  provides  local 
spatial  smoothing  and  may  be  viewed  as  providing  stochastic  interpolation  (Besag 
and  Kooperberg  1995).  A  common  definition  (which  we  adopt  in  the  example  at  the 
end  of  this  section)  is  that  two  areas  are  neighbors  if  they  share  a  common  boundary. 
In  non-lattice  systems,  this  is  clearly  ad  hoc. 

An  interesting  aspect  of  this  model  is  that  the  joint  distribution  is  undefined.  The 
form  of  the  joint  “density”  is 


p(s  I  o2)  oc  as  (m  r)exp 


(9.10) 


where  Wu>  =  1  if  areas  i  and  i'  are  neighbors  and  Wu>  =  0  otherwise.  In  the  spatial 
context,  r  is  the  number  of  connected  regions.  So  if  r  =  1,  there  are  no  collection 
of  areas  that  are  not  neighbors  of  the  remaining  areas,  which  means  that  we  cannot 
break  the  study  region  into  collections  of  areas  that  are  unconnected.  One  way  of 
thinking  about  this  model  is  that  it  specifies  a  prior  on  the  differences  between  levels 
in  different  areas  but  not  on  the  overall  level. 

There  are  two  equivalent  representations  of  model  (9.10)  that  are  commonly 
used.  In  one  approach,  the  intercept  fio  is  removed  from  the  mean  model  (9.7), 
while  in  the  other,  we  allow  an  intercept  /3q,  along  with  an  improper  uniform  prior 
for  this  parameter,  and  then  constrain  S  =  0.  In  the  following  we  assume  that  the 
intercept  has  been  excluded  from  the  model.  See  Besag  and  Kooperberg  (1995)  and 
Rue  and  Held  (2005)  for  further  discussion  of  this  model. 

Stage  Three:  Hyperpriors: 


A~N  (jip,Sp) 

a~2  ~  Gamma(ae,fee) 
a~2  ~  Gamma(as,  bs). 
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9. 7. 2  Hyperpriors 

Picking  a  prior  for  crs  is  not  straightforward  because  of  its  interpretation  as  the 
conditional  standard  deviation.  In  particular,  os  and  at  are  not  directly  comparable 
since  the  latter  has  a  marginal  interpretation  (on  the  log  relative  risk  scale). 

We  describe  how  to  simulate  realizations  from  (9.10)  to  examine  candidate  prior 
distributions.  As  already  noted,  due  to  the  rank  deficiency,  (9.10)  does  not  define  a 
probability  density,  and  so  we  cannot  directly  simulate  from  this  prior.  We  need  to 
define  some  new  notation  in  order  to  describe  the  method  of  simulation.  The  model 
can  be  written  in  the  form 


p(s  |  cr2)  =  (27r)-(m-r)/2|Q*|1/2o-“(m-r)  exp 


m—r 


(9.11) 


where  s  =  [si,...,sm]  is  the  collection  of  random  effects,  Q  is  a  (scaled) 
"precision”  matrix  of  rank  m  —  r,  with 


and  |Q*  |  is  a  generalized  determinant  which  is  the  product  over  the  m  —  r  nonzero 
eigenvalues  of  Q. 

Rue  and  Held  (2005)  give  the  following  algorithm  for  generating  samples  from 
(9.11): 

1.  Simulate  Zj  ~  N(0,  Aj1),  for  j  =  m—r+1, ...,  m,  where  Xj  are  the  eigenvalues 
of  Q  (recall  there  are  m  —  r  nonzero  eigenvalues  as  Q  has  rank  m  —  r). 

2.  Return  s  =  zm_r+ iera_r+i  +  +  . . .  +  znem  =  Ez  where  ej  are  the 

corresponding  eigenvectors  of  Q,  E  is  the  rn  x  (rn  -  r)  matrix  with  these 
eigenvectors  as  columns,  and  z  is  the  (rn  —  r)  x  1  vector  containing  Zj, 
j  =  m  —  r  +  1, . . . ,  m. 

The  simulation  algorithm  is  conditioned  so  that  samples  are  zero  in  the  null-space 
of  Q.  If  s  is  a  sample  and  the  null-space  is  spanned  by  Vi  and  v-2.  then  sTUi  = 
sTv 2  =  0.  For  example,  suppose  Q1  =  0  so  that  the  null-space  is  spanned  by 
1  and  the  rank  deficiency  is  1.  Then  Q  is  of  rank  m  —  1,  since  the  eigenvalue 
corresponding  to  1  is  zero,  and  samples  s  produced  by  the  algorithm  are  such  that 
sTl  =  0.  It  is  also  useful  to  note  that  if  we  wish  to  compute  the  marginal  variances, 
only  then  simulation  is  not  required,  as  they  are  available  as  the  diagonal  elements 
of  the  matrix  ]TT  A;71eJeJ. 
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Fig.  9.5  (a)  Nonspatial  and  (b)  spatial  random  effects  for  the  Minnesota  lung  cancer  data 


Example:  Lung  Cancer  and  Radon 

We  apply  the  Poisson  model  with  nonspatial  and  spatial  normal  random  effects,  that 
is,  the  model  given  by  (9.8)  and  (9.9).  We  note  that  model  (9.7)  does  not  aggregate 
correctly  from  a  plausible  individual-level  model;  see  Wakefield  (2007b)  and  the 
discussion  leading  to  model  (6.19).  The  prior  on  /3\  is  N(0, 1.172)  which  gives  a 
95%  interval  for  the  relative  risk  of  [0.1,10]. 

The  priors  on  cr2  and  cr2  require  more  care,  but  we  would  like  to  specify  priors 
in  such  a  way  that  the  nonspatial  and  spatial  contributions  are  approximately  equal. 
This  is  complicated  by  cr2  having  a  conditional  interpretation,  as  just  discussed.  We 
specify  gamma  priors  for  each  of  the  precisions,  u~2  and  crj2.  To  make  the  priors 
compatible,  we  first  specify  a  prior  for  a~2  and  evaluate  the  average  of  the  marginal 
variances  over  the  87  areas,  when  a2  =  1,  as  described  at  the  end  of  Sect.  9.7.2.  We 
then  match  up  the  means  of  the  gamma  distributions.  Following  the  development  of 
Sect.  9.6.2  for  the  unstructured  variability,  we  assume  that  the  unstructured  residual 
relative  risks  lie  in  the  interval  [0.2,  5]  with  probability  0.95  and  assume  d  =  2 
to  give  the  exponential  prior  distribution  Ga(l,0.140)  for  cr“2.  The  average  of  the 
marginal  variances  over  the  study  region  for  the  spatial  random  effects  is  0.21; 
hence,  the  average  of  the  marginal  precisions  is  approximately  1/0.21.  The  prior 
for  <j~2  is  therefore  Ga(0. 21,0. 140),  to  give  E[o\T2]  =  0.21  x  E[cr~2] . 

The  fitting  of  this  model  (using  INLA)  results  in  the  posterior  mean  estimates  e) 
and  Si  mapped  in  Fig.  9.5(a)  and  (b)  respectively.  Notice  that  the  scale  is  narrower 
in  panel  (b),  since  the  spatial  contribution  to  the  residuals  is  relatively  small  here, 
though  the  spatial  pattern  in  these  residuals  is  apparent.  As  we  discussed  with 
respect  to  prior  specification,  the  variances  cr2  and  a2  are  not  directly  comparable. 
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Table  9.6  Parameter  estimates  for  /3i ,  the  area-level  log  relative  risk  correspond¬ 
ing  to  radon,  and  measures  of  uncertainty  (standard  errors  and  posterior  standard 
deviations)  under  various  models,  for  the  Minnesota  lung  cancer  data 


Model 

Estimate  (x  102) 

Uncertainty  ( x  103 ) 

Poisson 

-3.6 

5.4 

Quasi-likelihood 

-3.6 

8.8 

Negative  binomial 

-2.9 

8.2 

Nonspatial  random  effects 

-2.8 

9.1 

Nonspatial  and  ICAR 
random  effects 

-2.8 

9.7 

and  so  we  calculate  an  approximate  proportion  of  the  total  residual  variance  that  is 
spatial  by  comparing  er}  with  an  empirical  estimate  of  the  marginal  variance  of  the 
collection  of  random  effects  {  S) .  i  =  1, . . . ,  to}.  Specifically,  we  calculate 

var(Sj) 
var(<Si)  +  (j2 

where  var(Sj)  is  the  empirical  variance  of  the  random  effects  and  if}  is  the  posterior 
median.  From  this  calculation,  the  fraction  of  the  total  residual  variability  that  is 
attributed  to  the  spatial  component  is  0.13. 

Table  9.6  provides  estimates  and  standard  error/posterior  standard  deviations  for 
the  log  relative  risk  associated  with  a  unit  increase  in  radon,  for  a  variety  of  models. 
We  include  a  model  with  nonspatial  normal  random  effects  only.  The  Poisson  and 
quasi-likelihood  methods  assume  the  same  form  of  (proportional)  mean-variance 
relationship,  while  the  negative  binomial  and  nonspatial  normal  random  effects 
approaches  imply  a  variance  that  is  quadratic  in  the  mean.  The  marginal  variance 
does  not  exist  under  the  improper  spatial  model,  but  here  the  spatial  contributions  are 
small.  We  might  therefore  expect  to  see  similar  conclusions  to  the  negative  binomial 
and  nonspatial  normal  random  effects  models.  This  is  borne  out  in  the  table,  with 
the  last  three  models  giving  similar  estimates  that  are  closer  to  zero  than  the  first 
two  models.  The  standard  error  from  the  spatial  model  does  increase  a  little  over  the 
nonspatial  random  effects  model. 

In  general,  if  strong  spatial  effects  are  present  and  the  exposure  surface  has  spatial 
structure,  then  when  spatial  random  effects  are  added  to  a  model,  large  changes  may 
be  seen  in  the  regression  coefficient  associated  with  exposure.  This  phenomenon, 
which  is  sometimes  known  as  confounding  by  location,  is  a  big  practical  headache 
since  it  is  difficult  to  decide  on  whether  to  attribute  spatial  variability  in  risk  to 
the  exposure  or  to  the  spatial  random  effects  (which  may  be  acting  as  surrogates  for 
unmeasured  confounders).  Wakefield  (2007b)  and  Hodges  and  Reich  (2010)  provide 
further  discussion. 
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9.8  Conjugate  Random  Effects  Models 

An  obvious  approach  to  extending  models  for  independent  data  is  to  assume  a 
random  effects  distribution  that  is  conjugate  to  the  likelihood.  We  illustrate  this 
approach,  and  its  shortcomings,  through  two  examples. 


Example:  Lung  Cancer  and  Radon 

A  Poisson-Gamma  conjugate  model  was  fitted  to  the  lung  cancer/radon  data  in 
Sect.  6.9  with: 

Stage  One:  Yz  \  m,Si  ~ind  Poisson^  hi),  with  log m  =  (3q  +  fhxu  for  i  = 

l,...,m. 

Stage  Two:  Si  \  b  ~ud  Gamma(6,  b)  for  i  =  1, . . . ,  m. 

The  advantage  of  this  model  is  that  the  random  effects  can  be  analytically 
integrated  from  the  model  to  give  Y)  |  /ii,  b  ~ind  NegBin(/Ti,  b),  i  =  1 ,m. 
However,  the  extension  to  allow  spatial  dependence  is  not  obvious,  unless  one 
introduces  normal  random  effects,  as  in  the  last  section. 


Example:  Seizure  Data 

Letting  =  Tiy  exp (xij(3),  consider  the  two-stage  model: 


Yij  |  // ij  ■  G/y  ^ ind  Poisson ( Hij ^ij ) 

£ij  |  b  ~ ad  Ga (h,  h). 


This  results  in  Yij  \  fii,b  ~ind  NegBin(/Xi,  b)  with  E[Ti]  =  ntj  and  var (Y^)  = 
fiijiy  1  +  Hij/b).  This  model  allows  for  excess-Poisson  variability  but  not  for 
dependence  of  observations  on  the  same  patient.  The  introduction  of  patient-specific 
random  effects  allows  for  the  latter  but  loses  the  analytical  tractability.  Specifically, 
the  two- stage  model 

Yij  |  ///j ,  Sj  ~ ind  Poisson  (  Sj ) 

Si  |  b  ~ ad  Ga(6, 6) 

leads  to  a  marginal  model  for  the  data  of  the  ith  individual  of 


Pr(yl0,  •  •  ■  ,  Vi4  |  Hij,b)  = 


r(b)  (b  +  m+)b+v^' 


bb  r(b  +  yi+) 


which  is  not  of  negative  binomial  form. 
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9.9  Generalized  Estimating  Equations  for  Generalized 
Linear  Models 

The  GEE  approach  was  described  in  Sect.  8.7  for  linear  models.  The  extension  to 
GLM  mean  models  is  conceptually  straightforward,  since  all  that  is  required  is 
specification  of  a  mean  model  and  a  working  covariance  model.  The  mean  is 


difJ-ij)  =  x-y 

where  fiij  =  E[Yij],  g(-)  is  a  link  function,  x  is  a  n  x  [k  +  1)  design  matrix,  and 
7  is  a  (k  +  1)  x  1  vector.  We  use  7  to  denote  the  parameters  in  the  marginal  mean 
model  to  distinguish  them  from  the  parameters  (3  which  have  been  used  to  represent 
the  mixed  model  conditional  parameters.  The  working  covariance  matrix  is 

var(Y)  =  W. 

and  in  a  GLM  setting,  W  will  usually  depend  on  7  and  on  additional  parameters  a 
so  that  W  =  W (7,  a).  Suppose  3  is  a  consistent  estimator  of  a.  Then,  GEE  takes 
the  estimator  7  that  satisfies 

m 

G(7,3)  =  ^E>IWi-1(Yl-Mi)  =  0! 

i- 1 

where  Di  =  dfj,i/d 7  is  rii  x  ( k  +  1)  and  Wi  =  Wi( 7,  3)  is  the  Hi  x  Hi  working 
covariance  model  for  unit  i,  i  =  1, . . . ,  m.  The  estimator  7  will  not  be  of  closed 
form,  unless  the  link  is  linear.  Under  mild  regularity  conditions, 


E-1/2(7-7)  ^dNfc+1(0,Ifc+1), 


where  VI,  takes  the  sandwich  form 


\i= 1 


YDlW-hov^W-'D, 


.i=  1 


YD]W-lDt 


Ki= 1 


(9.12) 


In  practice,  an  empirical  estimator  of  cov(  V  )  is  substituted  to  obtain  V,.  This 
produces  a  consistent  estimator  of  the  standard  error  of  7,  so  long  as  we  have 
independence  between  units  i  ^  =  1, . . . ,  to.  For  small  to,  the  variance 

estimator  may  be  unstable,  however. 

As  in  the  linear  case,  various  assumptions  about  the  form  of  the  working 
covariance  are  available.  We  write 

Wi  =  a1/2^(«)A1/2> 
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where  Ai  =  diag[var(Yji)) . . . ,  var (Yini)]T  and  R,  is  a  working  correlation  model. 
Common  choices  include  independence,  exchangeable,  AR(1),  and  unstructured. 
For  discrete  data,  there  is  often  no  natural  choice  since,  in  this  setting,  the  correlation 
is  not  an  intuitive  measure  of  dependence. 

For  small  m,  the  sandwich  estimator  will  have  high  variability,  and  so  model- 
based  variance  estimators  may  be  preferable  (and  we  would  probably  not  rely  on 
asymptotic  normality  if  m  were  small  anyway).  Model-based  estimators  are  more 
efficient  if  the  model  is  correct  and  efficiency  will  be  improved  if  we  can  pick  a 
working  correlation  matrix  that  is  close  to  the  true  structure. 

Published  comments  on  whether  to  assume  working  independence  or  a  more 
complex  form  are  a  little  in  conflict:  Liang  and  Zeger  (1986)  state  that  there  is  “little 
difference  when  correlation  is  moderate,’’  in  agreement  with  McDonald  (1993)  who 
states  “the  independence  estimator  may  be  recommended  for  practical  purposes.” 
On  the  other  hand,  Zhao  et  al.  (1992)  assert  that  assuming  independence  “can  lead 
to  important  losses  of  efficiency,”  in  line  with  Fitzmaurice  et  al.  (1993)  who  state  that 
it  is  “important  to  obtain  a  close  approximation  to  cov(Y))  in  order  to  achieve  high 
efficiency.”  The  issue  is  complex  since  it  depends  on,  among  other  things,  the  design 
and  whether  the  covariates  corresponding  to  the  parameters  are  constant  within  an 
individual  or  not. 


9.10  GEE2:  Connected  Estimating  Equations 

In  an  approach  coined  by  Liang  et  al.  (1992)  as  GEE2,  there  is  a  connected  set 
of  joint  estimating  equations  for  7  and  a..  This  approach  is  particularly  appealing 
if  the  variance-covariance  model  is  of  interest.  To  motivate  a  pair  of  estimating 
equations,  consider  the  following  model  for  a  single  individual  with  n  independent 
observations: 


Yi  |  t  tX  r'"1 ind  N  \pi  (7 ) )  Y*i  (7 5  G:)]  ■ 

For  example,  we  may  have  L)(  7,  a)  =  api{^)2,  i  =  1, . . . ,  n.  The  log-likelihood  is 


Differentiation  gives  the  score  equations  as 


dl  _  1  ^  f  \  1  y-  (  dpi  \  ( Yi  —  Pi)  1  y-  / dSi \  ( Yi  —  /_ 

^"■2^ W  Zi  —^1 

_  ( dpi\  ( Yi  —  pi)  ( 9Ri\  \(Yi  —  Pi)2  —  Sj\  /n 
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and 


6 w  =  i^fdEX  l  i^fdSjV  {Yj-mf 

da  2  ^  l  da  )  Et  +  2  ^  l  da  )  Ej 

2=1  x  '  2=1  '  '  L 


”  (BEX  [(Yj-^-Ej] 
h\da)  2  Ef 


(9.14) 


This  pair  of  quadratic  estimating  functions  is  unbiased  given  correct  specification 
of  the  first  two  moments;  to  emphasize,  normality  of  the  data  is  not  required.  A 
disadvantage  of  the  use  of  these  functions,  compared  to  the  original  GEE  method 
(which  is  sometimes  referred  to  as  GEE1),  is  that  if  the  variance  model  is  wrong, 
we  are  no  longer  guaranteed  a  consistent  estimator  of  7.  If  the  model  is  correct, 
however,  there  will  be  a  gain  in  efficiency. 

Let 

Si  =  {Yi  -  fM)2 

with  E[,S'i]  =  E{.  Under  normality, 

var(5i)  =  E[S2]  -  E[^]2  =  3 Ef  -  E2  =  2 E2 


Hence,  (9.13)  and  (9.14)  can  be  written 

dl  n  n 

g~  =  J2  HlVC'fXi -W)+E  E*w*rl(Si  ~  A)  (9.15) 

'  2=1  2=1 

dl  n 

fa=Y,FWC\Si-Ei)  (9.16) 

2=1 


where  D,  =  dni/d(3,  A  =  dEi/d(3,  Fl  =  dEi/da,  Vi  =  Ej,  and  W*  =  2 Ef. 

This  pair  of  estimating  equations  can  be  compared  with  the  usual  estimating 
equation  specification 


dl 


i= 1 


The  additional  term  is  the  information  about  7  in  the  variance. 

We  turn  to  the  dependent  data  situation  and  let  fir  denote  the  n,  x  1  mean  vector 
and  E,  the  n,  x  n,  covariance  matrix.  The  general  form  of  estimating  equations  is 


E 


A  0 

T 

'Vi 

Ci 

-1 

Yi 

o' 

.  A  Ft 

Cl 

Wi 

.  Si  -  Ei 

0 
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where  Di  =  dfii/df3 ,  Ei  =  dEi/d(3,  and  F,  =  dEi/da  and  we  have  “working” 
variance-covariance  structure 


V,  =  var(^) 

Ct  =  cov(Fl;  Si) 
Wt  =  var  (Si). 


When  Ci  =  0,  we  obtain 

m  m 

Gi(l,  a)  =  J2  DlV-\Yt  -  Mi)  +  Y,  EiWr\Si  -  St)  (9.17) 
2=1  2=1 

m 

G2( 7,  a)  =^i7tW-1(S);  -  (9.18) 

i=l 

which  are  the  dependent  data  version  of  the  normal  score  equations  we  obtained 
earlier,  that  is,  (9.15)  and  (9.16).  In  the  dependent  data  pair  of  equations,  we  have 
freedom  in  choosing  V[  and  W,.  In  particular,  the  latter  need  not  be  chosen  to 
coincide  with  that  under  a  multivariate  normal  model,  and,  since  this  choice  is 
difficult,  we  could  instead  choose  working  independence. 

It  can  be  shown  (Prentice  and  Zhao  1991,  Appendix  2)  that  (9.17)  and  (9.18) 
arise  from  the  quadratic  exponential  model 

p{Vi  I  di,  K)  =  A-1  exP [yJ°i  +  AA  +  Ci(yi)\,  (9.19) 

where  0,  =  [On, . . . ,  (AJ7  is  the  canonical  parameter, 

Wi  =  [yfi,ynyi2,  ■  ■  ■ ,  yf2,  yuya,  ■  ■  -]T 

is  the  vector  of  squared  responses,  c,  ( ■ )  is  a  function  that  defines  the  “shape,”  A  = 
Ai(0i,  A  i,  Ci)  is  a  normalization  constant,  and  Aj  =  [An,  K12,  ■  ■  ■ ,  A22,  K23,  ■  ■  -]T- 
As  an  example  of  this  form,  if  all  the  responses  are  continuous  on  the  whole  real  line 
and  Ci  =  0,  the  multivariate  normal  is  recovered  (Exercise  9.2).  Gourieroux  et  al. 
(1984)  showed  that  the  quadratic  exponential  family  is  unique  in  giving  consistent 
estimates  of  the  mean  and  covariance  parameters,  even  in  the  situation  in  which  the 
data  actually  arise  from  outside  this  family.  So,  as  the  exponential  family  produces 
desirable  consistency  properties  for  mean  parameters,  the  quadratic  exponential 
family  has  the  same  properties  when  mean  and  variance  parameters  are  of  interest. 

To  emphasize:  For  consistency  of  7,  we  require  the  models  for  both  Y,  and  Si 
to  be  correct,  and  there  is  increased  efficiency  over  the  single  estimating  equation 
version  (GEE1)  if  this  is  the  case.  This  approach  is  useful  if  the  variance-covariance 
parameters  are  of  primary  interest  as,  for  example,  in  some  breeding  and  genetic 
applications.  Otherwise,  if  can,  be  prudent  to  stick  with  GEE1. 


9. 1 1  Interpretation  of  Marginal  and  Conditional  Regression  Coefficients 
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9.11  Interpretation  of  Marginal  and  Conditional  Regression 
Coefficients 

To  illustrate  the  differences  in  interpretation  of  marginal  and  conditional  coeffi¬ 
cients,  we  examine  the  meaning  of  parameters  for  a  loglinear  model.  In  a  marginal 
model,  such  as  is  considered  under  GEE,  we  have 

E[y  |  x]  =  exp(7o  +7ix), 

in  which  case  exp (71)  is  the  multiplicative  change  in  the  average  response  over  two 
populations  of  individuals  whose  x  values  differ  by  one  unit .  Under  the  conditional 
mixed  model,  the  interpretation  of  regression  coefficients  is  conditional  on  the  value 
of  the  random  effect.  For  the  model 

E[Y  |  x,  b]  =  exp(/3o  +  fax  +  b ), 

with  b  |  Oq  N(0,  a2),  exp(fa)  is  therefore  the  change  in  the  expected  response 
for  two  individuals  with  identical  random  effects.  Sometimes,  the  comparison  is 
described  as  between  two  typical  (i.e.,  b  =  0)  individuals  who  differ  in  x  by  one 
unit.  The  marginal  mean  corresponding  to  this  model  follows  from  the  variance  of 
a  lognormal  distribution: 

E[Y  |  a;]  =  EjEfY  |  x,b)]=  exp(/30  +  er2/2  +  fax). 

Therefore,  for  the  random  intercepts,  loglinear  model  exp(fa)  has  the  same 
marginal  interpretation  to  exp^)  and  the  marginal  intercept  is  70  =  fa  +  a2 /2. 
We  now  consider  the  random  intercepts  and  slopes  model 

E [Y  |  x,  b]  =  exp  [(fa  +  b0)  +  (fa  +  h)x] 

where  b  =  [60,  bi]  and 


bo 

h 


Ajo  Doi  \ 

D\o  Du  ) 


In  this  model  exp(/?i)  is  the  relative  risk  between  two  individuals  with  the  same  b 
but  with  x  values  that  differ  by  one  unit.  That  is. 


exp(/3i) 


E[Y  |a:,b] 
E[Y  |  x-  1,  fo] ' 


An  alternative  interpretation  is  to  say  that  it  is  the  expected  change  between  two 
“typical  individuals,’’  that  is,  individuals  with  specific  values  of  the  random  effects, 
b  =  0.  Under  this  model,  the  marginal  mean  is 


E[Y  |  x]  —  exp  fa  +  Z?oo/2  +  x(fa  +  Dqi)  +  x2 Dn/2\ 
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so  that  a  quadratic  loglinear  marginal  model  has  been  induced  by  the  conditional 
formulation.  The  marginal  median  is  exp(/3o  +  /3ix)  so  that  exp(/?i)  is  the  ratio  of 
median  responses  between  two  populations  whose  x  values  differ  by  one  unit.  There 
is  no  such  simple  interpretation  in  terms  of  marginal  means. 

Hence,  marginal  inference  is  possible  under  a  mixed  model  formulation,  though 
care  must  be  taken  to  derive  the  exact  form  of  the  marginal  model.  Estimation 
of  marginal  parameters  via  GEE  produces  a  consistent  estimator  in  more  general 
circumstances  than  mixed  model  estimation,  though  there  is  an  efficiency  loss  if  the 
random  effects  model  is  correct. 


Example:  Seizure  Data 

The  marginal  mean  version  of  the  conditional  model  fitted  previously  in  this 
chapter  is 

E [Yij\  =  Tij  exp (70  +  'yixn  +  722^2  +  73^1^2)- 

The  parameters  are  interpreted  as  follows: 

•  exp(7o)  is  the  expected  rate  of  seizures  in  the  placebo  group  during  the  baseline 
period,  j  =  0  (this  expectation  is  over  the  hypothetical  population  of  individuals 
who  were  assigned  to  the  placebo  group). 

•  exp(7i )  is  the  ratio  of  the  expected  seizure  rate  in  the  progabide  group,  compared 
to  the  placebo  group,  during  the  baseline  period. 

•  exp(72)  is  the  ratio  of  the  expected  seizure  rate  post-baseline  as  compared  to 
baseline,  in  the  placebo  group. 

•  exp(73)  is  the  ratio  of  the  expected  seizure  rates  in  the  progabide  group  in  the 
post-baseline  period,  as  compared  to  the  placebo  group,  in  the  same  period. 
Hence,  exp(73)  is  a  period  by  treatment  effect  and  is  the  parameter  of  interest. 

The  loglinear  mean  model  suggests  the  variance  model  var(Yy)  =  03 //,, .  We 
consider  various  forms  for  the  working  correlation.  Table  9.7  gives  estimates  and 
standard  errors  under  various  models.  The  Poisson,  quasi-likelihood,  and  working 
independence  GEE  models  have  estimating  equation 

m 

Gin, «)  =  *10 Yi  -  fo)  =  0. 

i=l 

Consequently,  the  point  estimates  coincide  but  the  models  differ  in  the  manner 
by  which  the  standard  errors  are  calculated.  The  Poisson  standard  errors  are 
clearly  much  too  small.  The  coincidence  of  the  estimates  and  standard  errors 
for  independence  and  exchangeable  working  correlations  is  a  consequence  of  the 
balanced  design.  The  quasi-likelihood  standard  errors  are  increased  by  \/ 19.7  = 
4.4  (in  line  with  the  empirical  estimates  in  Table  9.2)  but  do  not  acknowledge 
dependence  of  observations  on  the  same  individual  (so  estimation  is  carried  out 


9.12  Introduction  to  Modeling  Dependent  Binary  Data 


457 


Table  9.7  Parameter  estimates  and  standard  errors  under  various  models  for  the  seizure  data 


Estimates  and  standard  errors 


Poisson  Quasi-Lhd  GEE  independence  GEE  exchangeable  GEEAR(l) 


70 

1.35 

0.034 

1.35 

0.15 

1.35 

0.16 

1.35 

0.16 

1.31 

0.16 

71 

0.027 

0.047 

0.027 

0.21 

0.027 

0.22 

0.027 

0.22 

0.015 

0.21 

72 

0.11 

0.047 

0.11 

0.21 

0.11 

0.12 

0.11 

0.12 

0.16 

0.11 

73 

-0.10 

0.065 

-0.10 

0.29 

-0.10 

0.22 

-0.10 

0.22 

-0.13 

0.27 

Ql,  012 

1.0 

0 

19.7 

0 

19.4 

0 

19.4 

0.78 

20.0 

0.89 

Parameter  meaning:  70  is  the  log  baseline  seizure  rate  in  the  placebo  group;  71  is  the  log  of  the 
ratio  of  seizure  rates  between  the  progabide  and  placebo  groups,  at  baseline;  72  is  the  log  of  the 
ratio  of  seizure  rates  in  the  post-baseline  and  baseline  placebo  groups;  73  is  the  log  of  the  ratio  of 
the  seizure  rate  in  the  progabide  group  as  compared  to  the  placebo  group,  post-baseline;  cki  and 
a 2  are  variance  and  correlation  parameters,  respectively 


as  if  we  have  59  x  5  independent  observations).  The  standard  errors  of  estimated 
parameters  that  are  associated  with  time-varying  covariates  (72  and  73)  are  reduced 
under  GEE,  since  within-person  comparisons  are  being  made  and  a  longitudinal 
design  can  be  very  efficient  in  such  a  study,  if  there  is  strong  within-individual 
dependence  (as  discussed  in  Sect.  8.3).  In  none  of  the  analyses  would  the  treatment 
effect  of  interest  be  judged  significantly  different  from  zero,  under  conventional 
levels. 


9.12  Introduction  to  Modeling  Dependent  Binary  Data 

Binary  outcomes  are  the  simplest  form  of  data  but  are,  ironically,  one  of  the  most 
challenging  to  model.  For  a  single  binary  variable  Y  all  moments  are  determined 
by  p  =  E[Y],  Specifically,  E[Yr]  =  p  for  r  >  1,  so  that  Bernoulli  random 
variables  cannot  be  overdispersed.  Before  turning  to  observations  on  multiple  units, 
we  initially  adopt  a  simplified  notation  and  consider  n  binary  observations  Y  = 
[Yi, . . . ,  Yri]'.  Under  conditional  independence  and  with  probabilities  p:)  =  E[Yj], 

n 

Pr(Y  =  y\p)  =  Y[pf  (1  -  ptf-w , 

3= 1 

with  p  =  [pi, . . . ,  prl]T .  In  Chap.  7,  we  saw  that  a  common  mean  form  is  the  logistic 
regression  model  with  log \pj /(I  ~Pj)\  =  Xj/3.  In  this  chapter  we  wish  to  formulate 
models  that  allow  for  dependence  between  binary  outcomes,  with  a  starting  point 
being  the  specification  of  a  multivariate  binary  distribution.  Such  a  joint  distribution 
can  be  used  with  a  likelihood-based  approach,  or  one  can  use  the  first  one  or  two 
moments  only  within  a  GEE  approach.  The  difficulty  with  multivariate  binary  data 
is  that  there  is  no  natural  way  to  characterize  dependence  between  pairs,  triples, 
etc.,  of  binary  responses.  In  the  dependent  binary  data  situation,  we  will  show  that 
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correlation  parameters  are  tied  to  the  means,  making  estimation  from  a  model  based 
on  means  and  correlations  unattractive. 

To  specify  the  joint  distribution  of  n  binary  responses  requires  2”  probabilities 
so  that  the  saturated  model  has  2n  —  1  parameters.  This  may  be  contrasted 
with  a  saturated  multivariate  normal  model  which  has  n  means,  n  variances,  and 
n(n  —  l)/2  correlations.  As  n  becomes  large,  the  number  of  parameters  in  the  binary 
saturated  model  is  very  large.  With  n  =  10,  for  example,  there  are  210  —  1  =  1, 023 
parameters  in  the  binary  model  as  compared  to  65  in  the  normal  model.  Our  aim 
is  to  reduce  the  2n  —  1  distinct  probabilities  to  give  formulations  that  allow  both  a 
parsimonious  description  and  the  interpretable  specification  of  a  regression  model. 

We  begin  our  description  of  models  for  multivariate  binary  data  in  Sect.  9.13  with 
a  discussion  of  mixed  models,  with  likelihood,  Bayesian  and  conditional  likelihood 
approaches  to  inference.  Next,  in  Sect.  9.14,  marginal  models  are  described. 

9.13  Mixed  Models  for  Binary  Data 

9.13.1  Generalized  Linear  Mixed  Models  for  Binary  Data 

In  Sect.  7.5,  we  discussed  a  beta-binomial  model  for  overdispersed  data.  This  form 
is  not  very  flexible,  for  the  reasons  described  in  Sect.  9.8,  and  so  we  describe  an 
alternative  mixed  model  with  normal  random  effects.  Let  Yi3  be  the  binary  “success” 
indicator  with  j  =  1 , . . . ,  n,  trials  on  each  of  i  =  1, . . . ,  m  units. 

Consider  the  GLMM  with  logistic  link: 

Stage  One:  Likelihood:  Yi3  \  pi3  ~ind  Bernoulli (jii3,pi3)  with  the  linear  logistic 
model 


log  ^ 


In  this  model,  (3  represents  a  (fc  +  1)  x  1  vector  of  fixed  effects  and  6,  a  {q  +  1)  x  1 
vector  of  random  effects,  with  q  <  k.  Let  x,3  =  [1,  Xi3\ , . . . ,  Xi3k\  be  a  (A;  +  1)  x  1 
vector  of  covariates,  so  that  x-i  =  \xt[ . . . . ,  xrrit }  is  the  design  matrix  for  the  fixed 
effects,  and  let  zl3  =  [1,  Zi3i , . . . ,  Zi3q]T  be  a  (fc  +  1)  x  1  vector  of  variables  that 
are  a  subset  of  X{3,  so  that  Z\  =  \zn , . . . ,  Zini]T  is  the  design  matrix  for  the  random 
effects. 

Stage  Two:  Random  effects  distribution:  bi  \  D  ^lt,i  N,i+1  (0.  D)  for  i  = 

l,...,m. 

As  we  have  repeatedly  stressed,  the  conditional  parameters  (3  and  the  marginal 
parameters  7  have  different  interpretations  in  nonlinear  situations,  and  for  a  logistic 
model,  there  is  no  exact  analytical  relationship  between  the  two.  However,  we 
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Fig.  9.6  Individual-level  curves  ( dotted  lines)  from  a  random  intercept  logistic  GLMM,  along  with 
marginal  curve  (solid  line).  The  specific  model  is  logit  (E[Y  |  b])  =  po+Pix,  with  0o  =  0,  /3i  =  1 
and  b  ~ad  N(0,  22).  The  approximate  attenuation  factor  of  the  marginal  curve,  which  is  given  by 
the  denominator  of  (9.21),  is  1.54 


may  approximate  the  relationship.  For  the  random  intercepts  model  bi  \  aft  ^ud 
N(0,<7q),  we  have,  for  a  generic  Bernoulli  response  Y  with  associated  random 
effect  b. 


E|y]  =  ,ex!(7i,=  E,[E(mi 


=  E, 


1  +  exp(a;7) 

exp(x/3  +  b) 


exp(a;/3/[c2crQ  +  l]1/2) 

1  +  exp(x(3  +  b)\  1  +  exp(a;/3/[c2<7g  +  l]1/2) 

where  c  =  16\/3/(157r)  (Exercise  9.1),  so  that 

13 


(9.20) 


7  : 


[c2cr2  +  l]1/2  ’ 


(9.21) 


Consequently,  the  marginal  coefficients  are  attenuated  toward  zero.  Figure  9.6 
illustrates  this  phenomena  for  particular  values  of  (3q,/3i,<Tq.  We  observe  that  the 
averaging  of  the  conditional  curves  results  in  a  flattened  marginal  curve.  This 
attenuation  was  first  encountered  in  Sect.  7.9  when  the  lack  of  collapsibility  of  the 
odds  ratio  was  discussed.  We  emphasize  that  one  should  not  view  the  difference 
in  marginal  and  conditional  parameter  estimates  as  bias.  If  cto  >  0  and  3\  /  0, 
the  parameters  will  differ,  but  they  are  estimating  different  quantities.  In  practice, 
if  we  fit  marginal  and  conditional  models  and  we  do  not  see  attenuation,  then  the 
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approximation  could  be  poor  (e.g.,  if  CTg  is  large)  or  some  of  the  assumptions  of  the 
conditional  model  could  be  inaccurate. 

For  the  general  logistic  mixed  model 


log 


(  E[F|b]  \ 
U-E[r  \b]J 


x(3  +  zb 


with  b  |  D  ~iid  Nq+i(0,  D),  we  obtain 


E  =  exp(x7) 

1  +  exp(a:7) 


exp  (xf3/  |  c2Dzzt  +  Iq+1  | (9+1)/2) 

1  +  exp  (xf5j  |  c2 Dzz T  +  Iq+ 1  |(<t+1)/2) 


so  that 

„  _ P _ 

7~  |  c2DzzT  +  Iq+i  |(a+i)/a- 

With  random  slopes  or  more  complicated  random  effects  structures,  it  is  therefore 
far  more  difficult  to  understand  the  relationship  between  conditional  and  marginal 
parameters. 

Marginal  inference  is  possible  with  mixed  models,  but  one  needs  to  do  a  little 
work.  Specifically,  if  one  requires  marginal  inference,  then  the  above  approxima¬ 
tions  may  be  invoked,  or  one  may  directly  calculate  the  required  integrals  using  a 
Monte  Carlo  estimate.  For  example,  the  marginal  probability  at  x  is 


Ely  I  -cl  =  -  V  exp(a;^  +  b(s)) 

[  1  J  S^l  +  exp(«3  +  6W) 


(9.22) 


where  the  random  effects  are  simulated  as  b<-s')  \  D  ~  Ng+i(0,  D),  s  =  1, . . . ,  S. 
A  more  refined  Bayesian  approach  would  replace  D  by  samples  from  the  posterior 
p{D  |  y). 

An  important  distinction  between  conditional  and  marginal  modeling  through 
GEE  is  that  the  latter  is  likely  to  be  more  robust  to  model  misspecification,  since  it 
directly  models  marginal  associations. 

Recall  that  the  logistic  regression  model  for  binary  data  can  be  derived  by  consid¬ 
eration  of  an  unobserved  (latent)  continuous  logistic  random  variable  (Sect.  7.6.1). 
This  latent  formulation  can  be  extended  to  the  mixed  model.  In  particular,  assume 
Uij  =  fiij  +  bi,  where  bi  \  a2  ~  N(0,  cr2)  and  Uij  follows  the  standard  logistic 
distribution,  that  is,  \  bi  ^ind  Logistic(/u7  +  bi,  1).  Without  loss  of  generality 
set,  Yj  j  -  1  if  Uij  >  c  and  0  otherwise.  Then 


Pr (Yij  =  1  |  bi)  =  Pr (Uij  >0  16*) 


exp (yij  +  bi  -  c) 

1  +  exp +  bi  -  c) 


and  taking  ii,j  =  Xij/3  +  c  produces  the  random  effects  logistic  model. 
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An  interpretation  of  erg  is  obtained  by  comparing  its  magnitude  to  7t2/3  (the 
variance  of  the  logistic  distribution,  which  can  be  viewed  as  the  within-person 
variability)  via  the  intra-class  correlation: 


p-corr (Uij,Uik)  -  ^  +°n3 /3' 

Note  that  p  is  the  marginal  correlation  (averaged  over  the  random  effects)  among  the 
unobserved  latent  variables  Ujj  and  not  the  marginal  correlation  among  the  Yi:) ’s. 
See  Fitzmaurice  et  al.  (2004,  Sect.  12.5)  for  further  discussion. 

We  examine  the  marginal  moments  further.  The  marginal  mean  is  E[Y)j]  = 
Pr(Ky  =  1)  =  E6.  [pij\  where  we  continue  to  consider  the  random  intercepts  only 
model 

_  exp  (xjjfi  +  bj) 

Pl°  1  +  exp(xij/3  +  bj)  ‘ 

The  expectation  is  over  the  distribution  of  the  random  effect.  We  have  already 
derived  the  approximate  marginal  mean  (9.20),  which  we  write  as 

=  exp  [x13P/{c2(jI  +  1)1/2] 

1  +  exp[xijf3/ (c2crfi  +  l)1/2] ' 


The  variance  is 


var (Yij)  =  E[var(Fjj  |  &*)]  +  var[E {Yij  I  &»)] 
=  E [Pij  -  p2i:j]  +  E[p2-]  -  E[Pij]2 
=  Pij(1 


illustrating  again  that  there  is  no  overdispersion  for  a  Bernoulli  random  variable. 
This  gives  the  diagonal  elements  of  the  marginal  variance-covariance  matrix.  The 
covariances  between  responses  on  the  same  unit  i  are 


cov(Yij  ,Yik) 


exp  {xjj/3  +  bj)  exp  (xik/3  +  bj)  \ 

1  +  exp(xij(3  +  h)  ’  1  +  exp(xik/3  +  h) ) 


(  exp (xjjP  +  bj)  \  /  exp(xikf3 +  bj)  \ 
V 1  +  exp(xijf3  +  bi) )  V 1  +  exp(xlk/3  +  bt) ) 


~ PijPik . 


so  note  that  the  marginal  covariance  is  not  constant  and  not  of  easily  interpretable 
form.  With  a  single  random  effect,  the  correlations  are  all  determined  by  the  single 
parameter  ao- 
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9.13.2  Likelihood  Inference  for  the  Binary  Mixed  Model 

As  with  the  GLMMs  described  in  Sect.  9.4,  the  integrals  required  to  evaluate  the 
likelihood  for  the  fixed  effects  (3  and  variance  components  a.  —  D  are  analytically 
intractable.  Unfortunately  the  Laplace  approximation  method  may  not  be  reliable 
for  binary  GLMMs,  particularly  if  the  random  effects  variances  are  large.  For  this 
reason  adaptive  Gauss-Hermite  quadrature  methods  are  often  resorted  to,  though 
care  in  implementation  is  required  to  ensure  that  sufficient  points  are  used  to  obtain 
an  accurate  approximation.  When  maximization  routines  encounter  convergence 
problems,  it  may  be  an  indication  that  either  the  model  being  fitted  is  not  supported 
by  the  data  or  that  the  data  do  not  contain  sufficient  data  to  estimate  all  of  the 
parameters. 


9.13.3  Bayesian  Inference  for  the  Binary  Mixed  Model 

A  Bayesian  approach  to  binary  GLMMs  requires  priors  to  be  specified  for  (3  and 
D.  As  in  Sect.  9.6.2,  the  priors  may  be  specified  in  terms  of  interpretable  quantities, 
for  example,  the  residual  odds  of  success.  The  information  in  binary  data  is  limited, 
and  so  sensitivity  to  the  priors  may  be  encountered,  particularly  the  prior  on  D. 
As  with  likelihood-based  approaches,  greater  care  is  required  in  computation  with 
binary  data.  Fong  et  al.  (2010)  report  that  the  INLA  method  is  relatively  inaccurate 
for  binary  GLMMs  so  that  MCMC  is  the  more  reliable  method  if  the  binomial 
denominators  are  small. 


Example:  Contraception  Data 

We  illustrate  likelihood  inference  for  a  binary  GLMM  using  the  contraception  data 
introduced  in  Sect.  9.2.1.  Let  Y,:]  =0/1  denote  the  absence/presence  of  amenorrhea 
in  the  ith  woman  at  time  t,:] ,  where  the  latter  takes  the  values  1,  2,  3,  or  4.  Also, 
let  di  =  0/1  represent  the  randomization  indicators  to  doses  of  100mg/150mg, 
for  i  =  1, . . . ,  1151  women  (576  and  575  women  received  the  low  and  high  doses, 
respectively).  There  are  n,  observations  per  woman,  up  to  a  maximum  of  4.  We 
consider  the  following  two-stage  model: 

Stage  One:  The  response  model  is  Y \j  |  pi3  ~ind  Bernoulli  {p.j3 )  with 


log  ^ 


1 


(9.23) 


so  that  we  have  separate  quadratic  models  in  time  for  each  of  the  two-dose  levels. 
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Table  9.8  Mixed  effects  model  parameter  estimates  for  the  contraception  data 


Likelihood  Laplace 

Likelihood  G-Ha 

Bayesian  MCMC 

Parameter 

Est. 

Std.  err. 

Est. 

Std.  err. 

Est. 

Std.  err. 

Intercept 

-3.8 

0.27 

-3.8 

0.30 

-3.6 

0.27 

Low-dose  time 

1.1 

0.25 

1.1 

0.27 

0.99 

0.25 

Low-dose  time1 2 

-0.044 

0.052 

-0.042 

0.055 

-0.015 

0.052 

High-dose  time 

0.55 

0.18 

0.56 

0.21 

0.55 

0.18 

High-dose  time2 

-0.11 

0.051 

-0.11 

0.050 

-0.11 

0.058 

O-Q 

2.1 

- 

2.3 

0.11 

2.2 

0.13 

aAdaptive  Gauss-Hermite  with  50  points 


Stage  Two:  The  random  effects  model  is  bi  \  a q  ~ud  N(0,  tTg). 

We  do  not  include  a  term  for  the  main  effect  of  dose,  since  we  assume  that 
randomization  has  ensured  that  the  two-dose  groups  are  balanced  at  baseline 
( t  =  0).  The  conditional  odds  ratios  exp(/?i)  and  exp(/32)  represent  linear  and 
quadratic  terms  in  time  for  a  typical  individual  (6j  =  0)  in  the  low-dose  group. 
Similarly,  exp((3i  +  /33)  and  exp(/32  +  At)  represent  linear  and  quadratic  terms  in 
time  for  a  typical  individual  ( bi  =  0)  in  the  high-dose  group. 

Table  9.8  gives  parameter  estimates  and  standard  errors  for  a  number  of  analyses, 
including  Laplace  and  adaptive  Gauss-Hermite  rules  for  likelihood  calculation.  We 
initially  concentrate  on  the  Gauss-Hermite  results  which  are  more  reliable  than 
those  based  on  the  Laplace  implementation.  Informally,  comparing  the  estimates 
with  the  standard  errors,  the  linear  terms  in  time  are  clearly  needed,  while  it  is  not 
so  obvious  that  the  quadratic  terms  are  required. 

In  terms  of  substantive  conclusions,  a  woman  assigned  the  high  dose,  when 
compared  to  a  woman  assigned  the  low  dose,  both  with  the  same  baseline  risk  of 
amenorrhea  (i.e.,  with  the  same  random  effect)  will  have  increased  odds  at  time  t  of 

exp  03t  +  /34i2) 

giving  increases  of  1.6,  2.0,  2.0,  1.6  at  times  1,  2,  3,  4,  respectively.  Hence,  the 
difference  between  the  groups  increases  and  then  decreases  as  a  function  of  time, 
though  it  is  always  greater  than  zero. 

The  standard  deviation  of  the  random  effects  a  =  2.3  is  substantial  here.  An 
estimate  of  a  95%  interval  for  the  risk  of  amenorrhea  in  the  low-dose  group  at 
occasion  1  is 


exp(— 3.8  +  1.1  -  0.042  ±  1.96  x  2.3) 

1  +  exp(— 3.8  +  1.1  -  0.042  ±  1.96  x  2.3) 


[0.0007,0.85], 


so  that  we  have  very  large  between-woman  variability  in  risk.  The  marginal  intra¬ 
class  correlation  coefficient  is  estimated  as  p  =  0.61  (recall  this  is  the  correlation 
for  the  latent  variable  and  not  for  the  marginal  responses). 
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Fig.  9.7  Probability  of  amenorrhea  over  time  in  low-  and  high-dose  groups  in  the  contraception 
data,  along  with  fitted  probabilities.  The  latter  are  calculated  via  Monte  Carlo  simulation,  with 
likelihood  estimation  in  the  mixed  model,  implemented  with  Gauss-Hermite  quadrature 


Table  9.9  Monte  Carlo  estimated  variances  (on  the  diagonal)  and  correlations  ( upper  diagonal), 
between  measurements  on  the  same  woman,  at  different  observations  occasions  (1M),  in  the  low- 
(left)  and  high-  (right)  dose  groups 


1 

2 

3 

4 

1 

2 

3 

4 

1 

0.14 

0.38 

0.36 

0.33 

1 

0.17 

0.39 

0.36 

0.33 

2 

0.20 

0.41 

0.39 

2 

0.23 

0.42 

0.40 

3 

0.24 

0.43 

3 

0.25 

0.43 

4 

0.25 

4 

0.24 

These  estimates  are  based  on  likelihood  estimation  in  the  mixed  model,  implemented  with  Gauss- 
Hermite  quadrature 


Allowing  the  random  effects  variance  to  vary  by  covariate  groups  is  important 
to  investigate  since  missing  such  dependence  can  lead  to  serious  inaccuracies 
(Heagerty  and  Kurland  2001).  The  assumption  of  a  common  op  in  the  two  groups  is 
important  for  accurate  inference  in  this  example.  We  fit  separate  logistic  GLMMs  to 
the  two-dose  groups  and  obtain  estimates  of  2.3  and  2.2,  illustrating  that  a  common 
do  is  supported  by  the  data. 

We  evaluate  the  marginal  means  calculation  using  Monte  Carlo  integration. 
These  means  are  shown,  along  with  the  observed  proportions,  in  Fig.  9.7.  We  see 
that  the  overall  fit  is  good,  apart  from  the  last  time  point  (for  which  there  is  reduced 
data  due  to  dropout). 

In  Table  9.9,  we  estimate  the  marginal  variance-covariance  and  correlation 
matrices  for  the  two-dose  groups  using  Monte  Carlo  integration.  As  we  have 
already  discussed  in  Sect.  9.13.1  a  random  intercepts  only  model  does  not  lead  to 
correlations  that  are  constant  across  time  (unlike  the  linear  model).  In  general,  the 
estimates  are  in  reasonable  agreement  with  the  empirical  variances  and  correlations 
reported  in  Table  9.1. 
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For  the  Bayesian  analysis,  the  prior  for  the  intercept  was  relatively  flat,  3o  ~ 
N(0,  2.382).  If  there  was  no  effect  of  time  (i.e.,  if  /3i  =  /?2  =  fiz  =  /?4  =  0)  then 
a  95%  interval  for  the  probabilities  for  a  typical  individual  would  be  exp(±1.96  x 
2.38)  =  [0.009,  0.99].  For  the  regression  coefficients,  we  specify  /3&  ~  N(0, 0.982) 
which  gives  a  95%  interval  for  the  odds  ratios  of  exp(±1.96  x  0.98)  =  [0.15,  6.8]. 
Finally,  for  <Jq2,  we  assume  a  Gamma(0.5,0.1)  prior  which  gives  a  95%  interval 
for  (To  of  [0.06,4.5].  More  informatively,  a  95%  interval  for  the  residual  odds 
is  [0.17,6.0].  These  priors  are  not  uninformative  but  correspond  to  ranges  for 
probabilities  and  odds  ratios  that  are  consistent  with  the  application. 

The  posterior  means  and  standard  deviations  are  given  in  Table  9.8,  and  we  see 
broad  agreement  with  the  MLEs  and  standard  errors  found  using  Gauss-Hermite. 
The  intra-class  correlation  coefficient  is  estimated  as  0.60  with  95%  credible  interval 
[0.55,0.67], 


9.13.4  Conditional  Likelihood  Inference  for  Binary  Mixed 


Models 


Recall  that  conditional  likelihood  is  a  technique  for  eliminating  nuisance  parame¬ 
ters,  in  this  case  the  random  effects  in  the  mixed  model.  Following  from  Sect.  9.5,  we 
outline  the  approach  as  applied  to  the  binary  mixed  model  with  random  intercepts. 
Consider  individual  i  with  binary  observations  yr  \ , . . . .  yini  and  assume  the  random 
intercepts  model  Ytj  |  A i,  (3*  ~  Bernoulli(p,j),  where 


and  Xi  =  Xi/3 ^  +  bt  so  that  (f  represents  those  parameters  associated  with 
covariates  that  are  constant  within  an  individual  and  /3*  those  that  vary.  Mimicking 
the  development  in  Sect.  9.5,  the  joint  distribution  for  the  responses  of  the  ith  unit  is 


exp  (A,  Vij  +  fl*T  Ej=i  BljVij ) 

n"=i  [l  +  exp  (Ai  +  /3*TXy )] 


exp  (A it2i  +  f3*Ttu) 


II"=i  [l+exp(Xi+(3"xlJ)] 


exp  (A it2i  +  (3*Ttu) 
k(\i,(3 ) 

p(tu,t2i  |  A i,f3*) 
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where 

rii  rii 

tli  =  'y  '  xijViji  ^2 i  =  y  '  Vij  =  Vi+ 

3=1  3=1 

and 

rii 

k(\i,  (3*)  =  jj  [1  +  exp  (A i  +  /3*Ta;T)]  . 

3=1 

Therefore,  the  conditioning  statistic  is  the  number  of  successes  on  the  zth  unit. 
We  have  conditional  likelihood 


Lc((3)  =  l[p(tu\t2i,(3*)  =  l[ 

i= 1  i= 1 


pjtiuhi  | 

p(t2i  I  (3*) 


where 


p{t2i  |  Aj,/3*) 


2^z=i 


exp  (Ajyi+  +  /3*T  Efcli  ) 


and  the  summation  is  over  the  j  ways  of  choosing  ones  out  of  ni  and 

yp  =  [yp , . . . ,  yp. ] ,  I  =  1, . . . ,  fj j  is  the  collection  of  these  ways.  Inference 
may  be  based  on  the  conditional  likelihood 


LC(P*) 


n 


n 


exp  (A iyi+  +  f3*T  YPjU  x\jVij) 

sir  *  ®p  (a*.+ + r  rsu  *J$) 

exp  (/TT  YPjL i  xljVij) 


Hence,  there  is  no  need  to  specify  a  distribution  for  the  unit-specific  parameters 
that  allow  for  within-unit  dependence,  as  they  are  eliminated  by  the  conditioning 
argument. 

As  an  example,  if  n*  =  3  and  yi  =  [0,  0, 1]  so  that  yi+  =  1,  then 


/f>  =  [1,0,0], 


42)  =  [0,1,0],  yp  —  [0,0,1] 


and  the  contribution  to  the  conditional  likelihood  is 

_ exp(/TX3) _ 

exp(/3*Ta^1)  +  exp((3*T x]2)  +  exp((3*Tx]3) ' 
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As  a  second  example,  if  m  =  3  and  y,  =  [  1 , 0, 1]  so  that  yl+  =  2,  then 


and  the  contribution  to  the  conditional  likelihood  is 


exp(/3*Xi  +  /3"x]2)  +  exp [p  x'a  -+■  p  x-3j  +  exp(/3*X2  +  (3*T x}3) ' 


There  is  no  contribution  to  the  conditional  likelihood  from  individuals  with 
rii  =  1  or  yi  +  =  0  or  ;/,+  =  m.  The  conditional  likelihood  can  be  computationally 
expensive  to  evaluate  if  rii  is  large,  for  example,  if  nt  =  20  and  y-i+  =  10 
there  are  =  184,  756  variations.  The  similarity  to  Cox’s  partial  likelihood 


(e.g.,  Kalbfleisch  and  Prentice  2002,  Chap.  4)  may  be  exploited  to  carry  out 
computation,  however. 

We  reiterate  that  the  conditional  likelihood  estimates  those  elements  of  (3*  that 
are  associated  with  covariates  that  vary  within  individuals.  If  a  covariate  only  varies 
between  individuals,  then  its  effect  cannot  be  estimated  using  conditional  likelihood. 
For  covariates  that  vary  both  between  and  within  individuals,  only  the  within- 
individual  contrasts  are  used. 


9.14  Marginal  Models  for  Dependent  Binary  Data 

We  now  consider  the  marginal  modeling  of  dependent  binary  data.  We  begin  by 
describing  how  the  GEE  approach  of  Sect.  9.9  can  be  used  for  binary  data  and  then 
describe  alternative  approaches. 


9.14.1  Generalized  Estimating  Equations 

For  the  marginal  Bernoulli  outcome  Yi:j  \  /i^  ~  Bemoul]i(/i,j )  and  with  a  logistic 
regression  model,  we  have  the  exponential  family  representation 


Pr {Yij  =  Vij  |  Xij)  =  “  Md)1  Vij 


=  exp  {yijOij  -  log[l  +  exp(%)]}  , 


where 
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For  independent  responses,  the  likelihood  is 


{m  rii  m  rii 

EE  VijOij  -  EE  log[l  +  exp(0jj)] 

i=l  j= l  i=l  j= 1 


=  exp  I E  E  lv  ]  • 


To  find  the  MLEs,  we  consider  the  score  equation 

dl  \  '  v—  d Uj  ddij 

<7)=^=SS^-^ 

m  rii  m 

=  El  El  xij  (Vij  ~~  fcj  )  =  El  X i  (Vi  ~  ) 

i=l j=l  i=l 

with  /r„t  =  [/i,-| , . . . ,  /./lnJT.  This  form  is  identical  to  the  use  of  GEE  with  working 
independence  and  so  can  be  implemented  with  standard  software,  though  we 
need  to  “fix  up”  the  standard  errors  via  sandwich  estimation.  Hence,  the  above 
estimating  equation  construction  offers  a  very  simple  approach  to  inference  which 
may  be  adequate  if  the  dependence  between  observations  on  the  same  unit  is 
small.  If  the  correlations  are  not  small,  then  efficiency  considerations  suggest  that 
nonindependence  working  covariance  models  should  be  entertained. 

As  with  other  types  of  data  (Sect.  9.9),  we  can  model  the  correlation  structure 
(Liang  and  Zeger  1986)  and  assume  var(Y^)  =  W)  with  Wi  =  ~ Ri{a) A1/2 

with  A,  a  diagonal  matrix  with  jth  diagonal  entry  var (Yij)  =  1  ~  /. Hj)  and 

R  i(ot)  a  working  correlation  model  depending  on  parameters  a.  In  this  case,  the 
estimating  function  is 


G( 7,  a)  =  E  D]W-\yi  -  Mi),  (9.24) 

i=l 

where  Di  =  dfii/dj.  As  usual,  an  estimate  of  a  is  required,  with  an  obvious  choice 
being  a  method  of  moments  estimator.  The  variance  of  the  estimator  takes  the  usual 
sandwich  form  (9.12). 


9.14.2  Loglinear  Models 

We  now  consider  another  approach  to  constructing  models  for  dependent  binary  data 
that  may  form  the  basis  for  likelihood  or  GEE  procedures.  Loglinear  models  are  a 
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Table  9.10  Probabilities  of 
the  four  possible  outcomes 
for  two  binary  variables  via  a 
loglinear  representation 


Vi  V2  Pr(n  =  yi,Y2  =  j/2) 


0 

0 

c(0) 

1 

0 

c(0)exp(6»P)) 

0 

1 

c(0)  exp(0^P) 

1 

1 

c(6)  exp(0P-)  +  9 +  Q9^) 

popular  choice  for  cross-classified  discrete  data  (Cox  1972;  Bishop  et  al.  1975).  We 
begin  by  returning  to  the  situation  in  which  we  have  n  responses  on  a  single  unit, 
Uj,  j  =  1, . . . ,  n.  A  saturated  loglinear  model  is 


Pr(Y  =  y)  =  c{9)  exp  |  ^  of]y3  +  ^  9f^VhVh  +  ■  •  ■  +  9i2...nyi  ■  •  •  V 
\i=i  h<h 

with  2n  -  1  parameters  9  =  [0^ , . . . ,  9^ ,  of2  >  •  •  •  >  9n-i,rv  •  •  • ,  ,  and 

normalizing  constant  c(9).  To  provide  an  interpretation  of  the  parameters,  consider 
the  case  of  n  =  2  trials  for  which 

pr(Yi  =  yi,  Y2  =  y2)  =  c{9)  exp  [d^yi  +  6('2)y2  +  ^  2/12/2)  , 
where  9  =  9^\  9^]T  and 

c^y1  =  H exp  (9i}yi  +  92]y2  +  9vi  2/12/2)  • 

yi=0 1/2=0 


Table  9.10  gives  the  forms  of  the  probabilities  for  the  loglinear  representation,  from 
which  we  can  determine  the  interpretation  of  the  three  parameters: 


exp(^1)) 


Pr(Fi  =  l\y2=0) 
Pr(Pi  =  0  |  y2  =  0) 


is  the  odds  of  an  event  at  trial  1,  given  no  event  at  trial  2, 


exp(0W)  = 


Pr(y2  =  1  |  yi  =  0) 
Pr(l2  =  0  |  Vl  =  0) 


is  the  odds  of  an  event  at  trial  2,  given  no  event  at  trial  1,  and 


p  (Mi2U  _  Pr(y2  =  1  I  2/1  =  1)/Pr(y2  =  0  |  2/i  =  1) 

Pl  12  J  Pr(y2  =  1 1  yi  =  0)/  Pr(y2  =  0  I  yi  =  0) 


470 


9  General  Regression  Models 


is  the  ratio  of  the  odds  of  an  event  at  trial  2  given  an  event  at  trial  1 ,  divided  by  the 
odds  of  an  event  at  trial  2  given  no  event  at  trial  1 .  Consequently,  if  this  parameter 
is  larger  than  1,  there  is  positive  dependence  between  Y\  and  Y%. 

For  general  n,  a  simplified  version  of  the  loglinear  model  is  provided  when  third- 
and  higher-order  terms  are  set  to  zero,  so  that 


Pr(Y  =  y ) 


c(6»)  exp  j  V  6{%j 


j<k 


(9.25) 


For  this  model. 


P1-(Y]  =  l\Yk=yk,Yl=0J^j,k) 
Pr(Y,-  =  0  |  Yk  =  yk,Yl  =  0,l^j,k) 


exp(6y  +0fk]yk). 


so  that  exp(@j1^)  is  the  (conditional)  odds  of  an  event  at  trial  j,  given  all  other 

responses  are  zero.  Further,  exp {0^})  is  the  odds  ratio  describing  the  association 
between  Yj  and  Yk,  given  all  other  responses  are  set  equal  to  zero,  that  is, 


Pr  (Yj  =  l,Yk  =  l\Yl=0  ,lji  j,  k)  Pr  (Yj  =  0,  Yk  =  0  |  Yt  =  0,  l  £  j,  k) 

Pr  {Yj  =  l,Yk  =  0\Yi  =  0,l?  j,  k )  Pr  (Yj  =  0,  Yk  =  1  |  Yt  =  0 ,1  ^  j ,  k) 

=  exp(^). 

The  quadratic  model  (9.25)  was  described  in  Sect.  9.10  and  was  suggested  for  the 
analysis  of  binary  data  by  Zhao  and  Prentice  (1990).  Recall  that  this  model  has  the 
appealing  property  of  consistency  so  long  as  the  first  two  moments  are  correctly 
specified.  The  quadratic  exponential  model  is  unique  in  this  respect. 

Unfortunately,  parameterizing  in  terms  of  the  6  parameters  is  unappealing  for 
regression  modeling  where  the  primary  aim  is  to  model  the  response  as  a  function 
of  x.  To  illustrate,  consider  binary  longitudinal  data  with  a  binary  covariate  x  and 
suppose  we  let  the  parameters  Q  depend  on  x.  The  difference  between  the  log  odds 
OS'  Hx  =  1)  and  9jL\x  =  0)  represents  the  effect  of  x  on  the  conditional  log  odds 
of  an  event  at  period  j,  given  that  there  were  no  events  at  any  other  trials,  which  is 
difficult  to  interpret.  We  would  rather  model  the  marginal  means  //,  and  these  are  a 
function  of  both  0^  and  0 ' 2 ; .  For  example,  for  the  n  =  2  case  presented  in  Table 
9.10,  the  marginal  means  are 


E[Ui]  =  C{0)  exp(0[1))[l  +  exp((y  +  6$)] 
E [Y2\  =  c{6)  exp(6y  )[1  +  exp(^1}  +  6^)}, 


and  these  forms  do  not  lend  themselves  to  straightforward  incorporation  of  covari¬ 
ates.  Hence,  alternative  approaches  have  been  proposed  as  we  now  discuss. 
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9.14.3  Further  Multivariate  Binary  Models 

A  number  of  approaches  are  based  on  assuming  a  marginal  mean  model,  to 
overcome  the  problems  described  in  the  previous  section,  along  with  a  second  set  of 
parameters  to  model  the  dependence. 

First,  we  may  reparameterize  the  model  via  the  mean  vector  //  and  second- 
and  higher-order  loglinear  parameters.  For  example,  we  may  consider  second-order 
parameters  only  and  work  with  //  and  the  loglinear  parameters  0'  2'1,  as  suggested  by 
Fitzmaurice  and  Laird  (1993).  The  latter  used  maximum  likelihood  for  estimation. 
There  are  two  disadvantages  to  this  approach.  First,  the  interpretation  of  the  0{2) 
parameters  depends  on  the  number  of  responses  n.  This  is  particularly  a  problem 
in  a  longitudinal  setting  with  differing  n, .  Hence,  this  approach  is  most  useful  for 
data  that  have  n,  =  n  for  all  i.  Second,  if  interest  lies  in  understanding  the  structure 
of  the  dependence,  the  conditional  odds  ratio  parameters  do  not  have  the  attractive 
simple  interpretation  of  marginal  odds  ratios. 

A  second  approach  is  based  on  modeling  the  correlations  in  addition  to  the 
means.  Let 

*  _  fJ>ij 

&ijk 

Pijk  =  corr(T)j ,  Yn~)  =  E[e^j  e^^.] 

Pijkl  =  ^[eijeikeil\ 


Pil...ni  —  ^leilei2  •  •  •  einJ' 

The  correlations  have  marginal  interpretations.  For  example,  ppjki  is  a  three-way 
association  parameter.  Bahadur  (1961)  defined  a  multivariate  binary  model  based  on 
the  marginal  means  and  these  correlations.  The  probability  for  the  set  of  outcomes 
on  unit  i  is 

rii 

Pr(Ei=yi)  =  II^(1-^)1“yy  x 

3=1 


/  J  Pijkeije.ik 

j<k 


^  Pijkie*je*ke*i  +  . . .  +  Pii...ne*1e*2  . 
j<k<l 


Unfortunately,  the  correlations  are  constrained  in  complicated  ways  by  the  marginal 
means.  As  an  example,  consider  two  measurements  on  a  single  individual,  Yu  and 
Yi 2,  with  means  pn  and  pi2.  The  correlation  is 


corral,  Yi2) 


Pr(T)x  1)^2  1)  PilPi2 

\Pil{  1  -  Pil)Pi2(3  -  Pi2)]1/2 
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Table  9.11  Notation  in  the 
case  of  n,i  =  2  binary 
responses  on  individual  i 


Ya 

0 

1 

Yi  1 

0 

1 

1  —  Mil  —  Mi2  +  Mil2 

Mil  —  Mil2 

1  —  Mi2 

Mi2  —  Mil2 

Mil2 

Mi2 

1  —  Mil 

Mil 

and 

max(0,  Hn  +  -  1)  <  Pr(Ya  =  l,Yi2  =  1)  < 

which  implies  complicated  constraints  on  the  correlation.  For  example,  if  /i-n  = 
0.8  and  m2  =  0.2,  then  0  <  corr(Yii,  Yj2)  <  0.25.  The  message  here  is  that 
correlations  are  not  a  natural  measure  of  dependence  for  binary  data  so  that  the 
Bahadur  representation  is  not  appealing. 

A  third  approach  (Lipsitz  et  al.  1991;  Liang  et  al.  1992)  is  to  parameterize  in 
terms  of  the  marginal  means  and  the  marginal  odds  ratios  defined  by.  Let 

5  =  Pr  (Yjj  =  l,Yik  =  l)Pr(y;3-  =  0,1 \k  =  0) 

ijk  Pr (Yij  =  1,  Yik  =  0)  Pr {Ytj  =  0,  Yik  =  1) 

=  Pr (YX]  =  1  |  Yik  =  1)/Pr(y,  =  0  |  Yik  =  1) 

Pr (Y^  =  1  |  Yik  =  0)/  Pr (Yxj  =  0  |  Yik  =  0)  ’ 

which  is  the  odds  (for  individual  z)  that  the  jth  observation  is  a  1,  given  the  /,:th 
observation  is  a  1,  divided  by  the  odds  that  the  jth  observation  is  a  1,  given  the  kth 
observation  is  a  0.  Therefore,  we  have  a  set  of  marginal  odds  ratios,  and  if  8ijk  >  1, 
we  have  positive  dependence  between  outcomes  j  and  k.  It  is  then  possible  to  obtain 
the  joint  distribution  in  terms  of  the  means  //.  where  /z,;  =  Pr  (Yij  =  1),  the  odds 
ratios  Si  =  [<5ji2, . . . ,  and  contrasts  of  odds  ratios.  To  determine  the 

probability  distribution  of  the  data,  we  need  to  find 

Hijk  =  E  [Yi:jYik\  =  Pr  (Y^  =  1  ,Yik  =  1), 

so  that  we  can  write  down  either  the  likelihood  function  or  an  estimating  function. 
For  the  case  of  rij  =  2  (see  Table  9.1 1),  we  have 


Pr(kji  —  1,  Ya  —  1)  Pr(Y)i  —  0,  Yj 2  —  0)  _  ^12(1  —  /in  —  /ij2  +  AL 12 ) 
PrfTii  =  1,  Yi2  =  0)  Pr(Tji  =  0,  Yi2  =  1)  {mi  -  ALi2)(Ab2  -  W12) 


and  so 

where  bi 


Atil2(^*12  —  1)  +  Abl2  bi  +  Sii2mi^i2  —  0, 
[mi  +  AL2HI  -  S112)  -  1,  to  give 


~bi  ±  \/bi  —  4(<5ji2  —  l)Atil 

2(()il2  —  1) 


mi2  — 
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if  <5*12  7^  1  and  /.q  12  =  yijyik  if  ^12  =  1.  The  likelihood  is 

rfii1  ~  (f  -  Mi2)1_3/‘2  +  (-l)(yil_yi2)(Mti2  ~  Unya)  (9-26) 


(Exercise  9.3). 

As  the  number  of  binary  responses  increases  so  does  the  complexity  of  solving 
for  the  Hijk’ s;  see  Liang  et  al.  (1992)  for  further  details.  In  the  case  of  large  rij,  there 
are  a  large  numbers  of  nuisance  odds  ratios,  and  assumptions  such  as  8ijk  =  <5  for 
*  =  1, . . . ,  m,  j,  k  =  1 , . . . ,  n,  may  be  made. 

In  a  longitudinal  setting,  another  possibility  is  to  take 

log  Sijk  —  d 0  T  Ctl  | tij  tik  |  j 


so  that  the  degree  of  association  is  inversely  proportional  to  the  time  between 
observations.  Computation  may  be  carried  out  by  setting  up  an  estimating  equation 
for  yi  and  a  method  of  moments  estimator  for  estimation  of  the  covariance 
parameters.  As  an  alternative,  GEE2  may  be  used  with  a  pair  of  linked  estimating 
equations  (Sect.  9.10). 

Letting  onjk  =  \og8ijk,  Carey  et  al.  (1993)  suggest  the  following  approach  for 
estimating  (3  and  a.  It  is  easy  to  show  that 

Pr (Yij  =  1  |  Yik  =  yik)  _  Pr(F^  =  1,  Yik  =  0) 

Pr (Yl0  =  0  |  Ylk=  ylk)  ~  exp =  0,  Yik  =  0) 

=  exp (yikaijk)  (  _  /i,J_  — lJk - )  , 

V  ftij  ftik  1  fJ'ijk  / 

which  can  be  written  as  a  logistic  regression  model  for  the  conditional  probabilities 

E  [Yj  |  Yik\. 


logit  (E  [Y^  |  Yik ])  =  log 


Pr(Fij 

Pr(^ii 


1  |  Yik  —  yik)  \ 
0  |  ^  ik  =  yik)  ) 


—  Vik^ijk  +  leg 


H-ij  l^ij  k 


1  yij  yik  yijk 


where  the  term  on  the  right  is  an  offset  (given  estimates  of  the  means).  Suppose,  for 
simplicity,  that  a.jjk  =  a.  Then,  given  current  estimates  of  (3,  a,  we  can  fit  a  logistic 
regression  model  by  regressing  Ytj  on  Yik  for  1  <  j  <  k  <  n,; ,  to  reestimate  a.  The 
offset  is  a  function  of  a  and  (3  so  iteration  is  required.  Consequently,  Carey  et  al. 
(1993)  named  this  approach  alternating  logistic  regressions.  Once  the  a  parameters 
are  estimated,  one  may  solve  for  var(YJ)  in  order  to  use  the  estimating  function 
(9.24). 

In  some  situations,  interest  may  focus  on  estimating/modeling  the  within-unit 
dependence.  Basing  a  model  on  correlation  parameters  is  not  appealing,  but  using 
marginal  log  odds  ratios  suggests  the  model  ai3k  =  x*]k XP  for  a  set  of  covariates  of 
interest  x*,-k  with  associated  regression  coefficients  yP . 
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Table  9.12  GEE  parameter  estimates  for  the  contraception  data 


Parameter 

GEE  independence 

GEE  exchangeable 

GEE  ALRa 

Est. 

Std.  err. 

Est. 

Std.  err. 

Est. 

Std.  err. 

Intercept 

-2.2 

0.18 

-2.2 

0.18 

-2.3 

0.16 

Low-dose  time 

0.67 

0.16 

0.70 

0.16 

0.70 

0.15 

Low-dose  time2 

-0.030 

0.033 

-0.033 

0.032 

-0.033 

0.031 

High-dose  time 

0.30 

0.11 

0.33 

0.11 

0.34 

0.11 

High-dose  time2 

-0.062 

0.030 

-0.064 

0.029 

-0.067 

0.028 

aAltemating  logistic  regression 


Example:  Contraception  Data 

Table  9.12  gives  parameter  estimates  and  standard  errors  for  various  implementa¬ 
tions  of  GEE,  for  the  marginal  model 

l°g  ^ )  =  7o  +  l\Uj  +  72 +  7 3diUj  +  74 d^,  (9.27) 

where  the  7  notation  emphasizes  that  we  are  estimating  marginal  parameters.  We 
initially  implement  GEE  with  working  independence;  in  general,  this  is  not  to  be 
recommended  unless  it  is  thought  that  the  outcomes  within  a  cluster  are  close 
to  independent.  We  also  allow  a  working  exchangeable  structure,  with  the  latter 
parameterized  in  terms  of  correlations.  Finally,  we  assume  a  working  exchangeable 
model  parameterized  in  terms  of  a  common  (marginal)  log  odds  ratio.  For  these 
data,  there  are  few  substantive  differences  between  the  approaches.  Under  the 
exchangeable  models,  the  common  correlation  is  estimated  as  0.36  (0.024)  (which 
is  in  line  with  the  correlations  in  Table  9.1),  while  the  common  log  odds  ratio  is 
estimated  as  2.0  (0.1 1).  The  latter  is  log  of  the  ratio  of  the  the  odds  of  amenorrhea 
at  time  t,  given  amenorrhea  at  time  s,  to  the  odds  of  amenorrhea  at  time  t,  given  no 
amenorrhea  at  time  s,  s  ^  t. 

We  may  compare  these  results  with  a  random  intercept  GLMM.  The  Bayesian 
marginal  estimates  obtained  by  dividing  the  posterior  means  and  the  posterior 
standard  deviations  by  ( c2<Tq  +  l)1/2  result  in  the  estimates  (standard  errors):  —2.3 
(0.17),  0.68  (0.15),  -0.019  (0.032),  0.34  (0.11),  and  -0.066  (0.035),  which  are  in 
close  agreement  with  the  point  and  interval  estimates  in  Table  9.12.  The  marginal 
probabilities  from  the  GEE  exchangeable  model  were  identical  to  those  obtained  via 
Monte  Carlo  integration  in  the  mixed  model  (and  displayed  on  Fig.  9.7). 

As  we  have  already  mentioned,  model  checking  is  very  difficult  with  binary  data. 
For  data  with  replication  across  common  x  variables,  one  may  obtain  empirical 
probabilities  and/or  logits  (as  in  Fig.  9.1),  which  may  suggest  model  forms  in  an 
exploratory  model  building  exercise  or  may  be  compared  with  fitted  summaries. 
Similarly,  the  dependence  structure  may  be  examined  across  covariate  groups,  via 
empirical  correlations  or  odds. 
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Fig.  9.8  Logit  of  probability  of  amenorrhea  over  time  in  high-  and  low-  dose  groups  with  marginal 
fits  from  exchangeable  GEE  model 

Figure  9.8  shows  the  fitted  logistic  curves  in  each  dose  group  versus  time  along 
with  the  logits  of  the  probabilities  of  amenorrhea.  The  vertical  lines  represent  95% 
confidence  intervals  for  the  logits.  These  intervals  increase  slightly  in  width  over 
time  as  dropout  occurs.  Here,  we  would  conclude  that  the  model  fit  is  reasonable. 


9.15  Nonlinear  Mixed  Models 

We  now  turn  attention  to  the  nonlinear  mixed  model  (NLMM).  Our  development 
will  be  much  shorter  for  this  class  of  models.  One  reason  for  this  is  that  the  non¬ 
linearity  results  in  very  little  analytical  theory  being  available.  Also,  traditionally, 
dependent  nonlinear  data  have  been  analyzed  with  mixed  models  and  not  GEE 
because  the  emphasis  is  often  on  unit-level  inference.  The  fitting,  inferential  sum¬ 
marization  and  assessment  of  assumptions  will  be  illustrated  using  the  theophylline 
data  described  in  Sect.  9.2.3. 

In  a  nonlinear  mixed  model  (NLMM),  the  first  stage  of  a  linear  mixed  model  is 
replaced  by  a  nonlinear  form.  We  describe  a  specific  two-stage  form  that  is  useful  in 
many  longitudinal  situations.  The  response  at  time  is  yl;l ,  and  x,:/  are  covariates 
measured  at  these  times,  i  =  1, . . . ,  m,  j  =  1, . . . ,  rij.  Let  N  =  Yl'iLi 

Stage  One:  Conditional  on  random  effects,  bi,  the  response  model  is 


Vij  —  /  (t/ij  i  tij  )  "E  eij  ! 


(9.28) 
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where  /(•,  •)  is  a  nonlinear  function  and 

Vij  =  ij  ft  T  Z’ij  • 

with  a  (k  +  1)  x  1  vector  of  fixed  effects  (3,  a  (q  +  1)  x  1  vector  of  random 
effects,  bi ,  with  q  <  k,  Xi  =  [xn, . . . ,  Xini]T  the  design  matrix  for  the  fixed  effect 
with  x.ij  =  [1,  Xij i, . . . ,  Xijk]T  and  Zi  =  [zn, . . . ,  ztnj]T  the  design  matrix  for  the 
random  effects  with  zt]  =  [1,  Zi3 , . . . ,  Zijq]T. 

Stage  Two:  Random  terms: 


E[ej]  =  0,  var(ej)  =  E^a), 

E [bi]  =  0,  var(bj)  =  D(a), 
co  v(6i,ei)  =  0 

where  a  is  the  vector  of  variance-covariance  parameters.  A  common  model 
assumes 


£i  ^ ind  N(0,  Ce  I„J, 

bi  ~ud  N(0,  D). 


For  this  model,  a  =  [a 

For  nonlinear  models  even  the  first  two  moments  are  not  available  in  closed  form. 
In  general: 


E [Yij]  =  E„.  [f{xijft  +  Zijbi, Uj)\  ^  f(xijft,  tij ) 

where  f(xij/3,tij)  is  the  nonlinear  curve  evaluated  at  bi  =  0.  Hence,  unlike  the 
LMM,  the  nonlinear  curve  at  a  time  point  averaged  across  individuals  is  not  equal 
to  the  nonlinear  curve  at  that  time  for  an  average  individual  (i.e.,  one  with  bi  =  0). 
The  variance  is 


var(Fii)  =  a\  +  varb.  [f(xij/3  +  z,,b,.  >■,,)' 

so  that  the  marginal  variance  of  the  response  is  not  constant  across  time,  even  when 
we  have  a  random  intercepts  only  model  (unlike  the  LMM).  For  responses  on  the 
same  individual,  dependence  is  induced  through  the  common  random  effects: 

COV  (Yij  .  Yij/  )  —  C0Vb.  [_/'  (Xij  f3  -f  Zjj  bi ,  tij  ) ,  f  (Xij'  ft  T  Zij/  bi  .  tij/  )] 

but,  as  with  the  GLMM,  there  is  no  closed  form  for  the  covariance.  Finally,  for 
observations  on  different  individuals: 


co  v(Yij,Yi/j>)  =  0 
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for  i  /  i! .  The  data  do  not  have  a  closed-form  marginal  distribution.  These  forms 
illustrate  that  picking  particular  random  effect  structures  cannot  be  based  on  specific 
requirements  in  terms  of  the  marginal  variance  and  covariance.  Rather,  this  choice 
should  be  based  on  the  context  and  on  data  availability. 

In  a  NLMM,  the  interpretation  of  parameters  is  usually  tied  to  the  particular 
model.  In  a  GLMM,  one  can  make  use  of  linearity  on  the  linear  predictor  scale  to 
have  an  interpretation  in  terms  of  unit  changes  in  covariates  (as  we  have  illustrated 
for  loglinear  and  logistic  linear  models).  In  a  NLMM,  this  will  not  be  possible, 
however  (since  the  model  is  nonlinear !). 

We  next  briefly  consider  parameterization  of  the  model,  before  considering 
likelihood  and  Bayesian  inference  in  Sects.  9.17  and  9.18,  respectively.  A  GEE 
approach  is  briefly  considered  in  Sect.  9.19,  but  as  previously  mentioned,  this  is 
not  as  popular  as  likelihood  and  Bayes  approaches,  and  so  this  section  is  short. 
The  nonlinearity  of  the  model  means  there  is  no  sufficient  statistic  for  /3,  and  so 
conditional  likelihood  cannot  be  used. 


9.16  Parameterization  of  the  Nonlinear  Model 

In  contrast  to  LMMs  and  GLMMs,  there  is  no  obvious  way  to  parameterize  a 
NLMM,  and  the  way  one  proceeds  is  an  art  form.  Given  the  normal  random  effects 
distribution,  one  usually  parameterizes  to  quantities  on  the  whole  real  line.  This 
issue  relates  to  the  discussion  of  the  solution  locus  and  the  parameterization  of 
nonlinear  models  given  in  Sect.  6.15. 


Example:  A  Simple  Pharmacokinetic  Model 

The  simplest  pharmacokinetic  model  is 

E [Y  |  V,  ke ]  =  ^  exp(-M) 

where  D  is  the  known  dose,  V  >  0  is  the  volume  of  distribution,  and  ke  >  0  is  the 
elimination  rate  constant.  The  obvious  parameterization  is  /30  =  log  V,  /?i  =  log  ke. 
A  key  parameter  of  interest  is  the  clearance,  defined  as  Cl  =  V  x  ke,  and  so  one 
may  alternatively  take  /3*  =  log  Cl  with  /3q  =  /3o  as  before.  This  parameterization 
has  a  number  of  advantages.  A  first  advantage  is  that  the  clearance  for  individual  i 
is  often  modeled  as  a  function  of  covariates,  for  example,  via  a  loglinear  model  of 
the  form 


log  Cl  =  a0  +  ol\ Xi 


(9.29) 
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where  Xi  is  a  covariate  of  interest  such  as  weight.  A  second  advantage  is  that  the 
clearance  is  a  very  stable  parameter  to  estimate.  The  clearance  is  the  dose  D  divided 
by  the  area  under  the  concentration-time  curve,  and  this  area  tends  to  be  very  well 
estimated  (unless  there  are  few  sample  points  at  large  times)  and  hence  so  does  the 
clearance,  Cl. 

If  a  Bayesian  approach  is  adopted,  then  the  prior  must  clearly  be  specific  to  the 
parameterization.  For  example,  for  (3  =  [(30 ,  /31]T  and  (3*  =  [(3q  ,  /3*]T  the  prior  (3  ~ 
N2(/a0,27o)  with  fixed  hq,Eq,  will  clearly  give  different  inference  to  assuming 

/3*~N2(/x0,^0).  □ 

There  is  some  theoretical  work  on  choosing  parameterizations  (Bates  and 
Watts  1980),  but  good  parameterizations  are  often  found  through  experience  with 
particular  models.  The  accuracy  of  asymptotic  approximations  is  also  crucially 
dependent  on  the  choice  of  parameterization,  with  stable  parameters  likely  to  display 
good  asymptotic  properties.  The  examination  of  likelihood  contours  (as  was  done 
in  Sect.  6.12)  can  indicate  whether  asymptotic  distributions  are  likely  to  be  accurate 
or  not. 

With  many  nonlinear  models,  care  must  be  taken  to  ensure  the  model  is 
identifiable  in  the  sense  that  if  0  /  O' ,  f(0)  7^  f(O').  If  there  is  non-identifiability, 
then  one  may  either  reparameterize  the  model  or  enforce  identifiability  through  the 
prior.  The  latter  can  be  messy,  however. 

Unfortunately,  preserving  identifiability  and  retaining  an  interpretable  parameter 
cannot  usually  be  simultaneously  achieved.  We  illustrate  the  problems  with  an 
example. 


Example:  Pharmacokinetics  of  Theophylline 

As  discussed  in  Sect.  6.2,  the  one-compartment  open  model  is  non-identifiable.  We 
illustrate  by  parameterizing  as  [ke .  ka .  Cl]  to  give  the  mean  model,  for  a  generic 
individual,  as 


T) 

E[V]  =  Cl(k  e_ak  ^  [exp(— M)  -  exp(— M)]  •  (9.30) 

This  form  is  known  as  the  “flip-flop”  model  because  the  parameters  [ke,ka,Cl] 
give  the  same  curve  as  the  parameters  [ka,  ke .  Cl}.  To  enforce  identifiability,  it  is 
typical  to  assume  that  ka  >  ke  >  0,  since  for  many  drugs,  absorption  is  faster  than 
elimination.  This  suggests  the  parameterization  [log  ke ,  log {ka  —  ke),  log  Cl]. 
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As  with  the  linear  mixed  and  generalized  linear  mixed  models  already  considered, 
the  likelihood  is  defined  with  respect  to  fixed  effects  (3  and  variance  components  or. 


m  r, 

p{y  I  P,Ol)  =  n  /  P(.Vi  I  bi,P,°e)  X  p(bi  I  D)  dbi >  (9-31) 

„• _ i  J  b. 


with  a  =  [D,  of]. 

The  first  difficulty  to  overcome  is  how  to  calculate  the  required  integrals,  which 
for  nonlinear  models  are  analytically  intractable  (recall  for  the  LMM  they  were 
available  in  closed  form).  As  with  the  GLMM,  two  obvious  approaches  are  to  resort 
to  Laplace  approximations  or  adaptive  Gauss-Hermite.  Pinheiro  and  Bates  (2000, 
Chap.  7)  contains  extensive  details  on  these  approaches  (see  also  Bates  2011).  We 
wish  to  evaluate 

p{Vi  |  /3,  a)  =  (2Tra2)~ni/2(2ir)~(q+1)/2\D\~1/2  j  exp [mg(bi)]  dbi, 
where 


-  2nlg(bi)  =  [yt  -  ft(f3,bt,xl)Y[yt  -  fi((3,bi,Xi)\/a2  +  b]D  1bl  (9.32) 
and 

fi  (Z3?  bi  )  —  [y(*^il/3  T  Zil  bi ,  til'j ,  •  .  .  ,  f  [Xim  f3  T  %im  bi  •>  f  ini  )]  • 

The  Laplace  approximation  (Sect.  3.7.2)  is  a  second-order  Taylor  series  expansion 
of  g(-)  about 

b,  =  argmin  [-g{bt)] 

bi 

where  this  minimization  constitutes  a  penalized  least  squares  problem.  For  a 
nonlinear  model,  numerical  methods  are  required  for  this  minimization,  but  the 
dimensionality,  q+ 1,  is  typically  small.  With  respect  to  (9.31),  the  second  difficulty 
is  how  to  maximize  the  likelihood  as  a  function  of  (3  and  or,  again  see  Pinheiro  and 
Bates  (2000)  and  Bates  (201 1)  for  details. 

In  terms  of  the  random  effects,  empirical  Bayes  estimates  may  be  calculated, 
as  with  the  GLMM.  In  the  example  that  follows,  we  evaluate  the  MLEs  using 
the  procedure  described  in  Lindstrom  and  Bates  (1990)  in  which  estimates  of  b, 
are  (3  are  first  obtained  by  minimizing  the  penalized  least  squares  criteria  (9.32), 
given  estimates  of  D  and  of  Then  a  first-order  Taylor  series  expansion  of  /, 
about  the  current  estimates  of  (3  and  b,  is  carried  out,  which  results  in  a  LMM. 
For  such  a  model,  the  random  effects  may  be  integrated  out  analytically,  and  the 
subsequent  (approximate)  likelihood  can  be  maximized  with  respect  to  D  and  of . 
This  procedure  is  then  iterated  until  convergence. 
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Approximate  inference  for  [/3,  a]  is  carried  out  via  asymptotic  normality  of  the 
MLE: 


where  Ipp,  Ipa,  Iap ,  and  Iaa  are  the  relevant  information  matrices. 

Many  approximation  strategies  have  been  suggested  for  nonlinear  hierarchical 
models,  but  care  is  required  since  validity  of  the  asymptotic  distribution  depends 
on  the  approximation  used.  For  example,  a  historically  popular  approach  (Beal  and 
Sheiner  1982)  was  to  carry  out  a  first-order  Taylor  series  about  E[6j]  =  0  to  give 


N 


Ipp  I Pa 

IaP  Iaa 


Vij  —  f  {‘Tij  ft  i  T  Zjj  bf ,  t'lj  )  Cij 

~  f(xzjftiltij)  +  bIi 

This  first-order  estimator  is  inconsistent,  however,  and  has  bias  even  if  n,  and  m 
both  go  to  infinity;  see  Demidenko  (2004,  Chap.  8). 


Example:  Pharmacokinetics  of  Theophylline 

For  these  data,  the  one-compartment  model  with  first-order  absorption  and  elimina¬ 
tion  is  a  good  starting  point  for  analysis.  This  model  was  described  in  some  detail 
in  Sect.  6. 16.3.  The  mean  concentration  at  time  point  ty  for  subject  i  is 

DJtaikei  [exp (-keitij)  -  exp (kaitij)} ,  (9.33) 

™ei) 

where  we  have  parameterized  in  terms  of  [C7j,  kai,  kei]  and  Di  is  the  initial  dose. 

We  first  fit  the  above  model  to  each  individual,  using  nonlinear  least  squares; 
Fig.  9.9  gives  the  resultant  95%  asymptotic  confidence  intervals.  The  between- 
individual  variability  is  evident,  particularly  for  log  ka.  Figure  9.10  displays  the 
data  along  with  the  fitted  curves.  The  general  shape  of  the  curve  seems  reasonable, 
but  the  peak  is  missed  for  a  number  of  individuals  (e.g.,  numbers  10,  1,  5,  and  9). 

Turning  now  to  a  NLMM,  we  assume  that  each  of  the  parameters  is  treated  as  a 
random  effect  so  that 


log  kei  =  fti  +  bu  (9.34) 

log  kai  =  p2  +  b2i  (9.35) 

log  Cli  =  ft3  +  b3i  (9.36) 
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Fig.  9.9  95%  confidence 
intervals  for  each  of  the  three 
parameters  and  1 2  individuals 
in  the  theophylline  data. 
Obtained  via  individual  fitting 
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with  bi  |  D  ~  Na(0,  D )  where  bi  =  [ba,  bt 2,  b^y.  The  estimates  resulting  from 
the  Lindstrom  and  Bates  (1990)  method  described  in  the  previous  section  are  given 
in  Table  9.13.  The  standard  deviation  of  the  random  effects  for  log  ka  is  large,  as  we 
anticipated  from  examination  of  Fig.  9.9. 


482 


9  General  Regression  Models 


0  5  10  15  20  25  05  10  15  20  25 


10 

8 

6 

4 

2 

0 


Fig.  9.10  Concentrations  versus  time  for  12  individuals  given  the  drug  theophylline,  along  with 
individual  nonlinear  least  squares  fits 


9.18  Bayesian  Inference  for  the  Nonlinear  Mixed  Model 

The  first  two  stages  of  the  model  are  as  in  the  likelihood  formulation.  We  first 
discuss  how  hyperpriors  may  be  specified,  before  discussing  inference  for  functions 
of  interest. 


9.18.1  Hyperpriors 

A  Bayesian  approach  requires  a  prior  distribution  for  (3,  a.  As  with  the  LMM, 
a  proper  prior  is  required  for  the  matrix  D.  In  contrast  to  the  LMM,  a  proper 
prior  is  required  for  (3  also,  to  ensure  the  propriety  of  the  posterior  distribution. 
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Table  9.13  Comparison  of  likelihood  and  Bayesian  NLMM  estimation  techniques  for  the  theo¬ 
phylline  data 


PK  label 

Parameter 

Likelihood 

Bayes  normal 

Bayes  lognorm 

Bayes  power 

Est. 

(s.e.) 

Est. 

(s.d.) 

Est. 

(s.d.) 

Est. 

(s.d.) 

log  ke 

Pi 

-2.43 

(0.063) 

-2.46 

(0.077) 

-2.43 

(0.075) 

-2.25 

(0.083) 

log  ka 

h 

0.45 

(0.20) 

0.47 

(0.19) 

0.26 

(0.23) 

0.45 

(0.22) 

log  Cl 

fa 

-3.21 

(0.081) 

-3.23 

(0.082) 

-3.22 

(0.090) 

-3.22 

(0.092) 

log  ke 

V~Dii 

0.13 

(-) 

0.19 

(0.049) 

0.22 

(0.059) 

0.23 

(0.061) 

log  ka 

V  D22 

0.64 

(-) 

0.62 

(0.15) 

0.72 

(0.19) 

0.69 

(0.18) 

log  Cl 

VD33 

0.25 

(-) 

0.25 

(0.051) 

0.30 

(0.071) 

0.29 

(0.072) 

For  the  likelihood  summaries,  we  report  the  MLEs  and  the  asymptotic  standard  errors,  while  for 
the  Bayesian  analysis,  we  report  the  mean  and  standard  deviation  of  the  posterior  distribution.  The 
three  Bayesian  models  differ  in  the  error  models  assumed  at  the  first  stage  with  normal,  lognormal, 
and  power  models  being  considered 


If  parameters  occur  linearly,  then  proper  priors  are  not  required,  but,  as  usual,  the 
safest  strategy  is  to  specify  proper  priors. 

For  simplicity,  we  assume  that  random  effects  are  associated  with  all  parameters 
and  as,  in  Sect.  8.6.3,  parameterize  the  model  as  r  =  a~ 2,  W  =  D  _1,  and  (3i  = 
(3  +  bi  for  i  =  1 , ,m,  with  the  dimensionality  of  (3i  being  k  +  1.  The  joint 
posterior  is 


P(Pi,---,Pm,T,l3,W  |  y)  oc  JJ  [p{yi  |  f3i,r)p(f3i  \  (3,W)}n((3)n(T)Tr(W). 

i=l 


We  assume  the  priors 

P  ~  Nfc+i(/30,  To),  r~Ga(a0,6o),  W  ~  Wishfc+i(r,  R^1), 

for  further  discussion  of  this  specification,  see  Sect.  8.6.2.  Closed-form  inference 
is  unavailable,  but  MCMC  is  almost  as  straightforward  as  in  the  LMM  case.  The 
INLA  approach  is  not  (at  time  of  writing)  available  for  the  Bayesian  analysis  of 
nonlinear  models.  With  respect  to  MCMC,  the  conditional  distributions  for  (3,  r,  W 
are  unchanged  from  the  linear  case.  There  is  no  closed-form  conditional  distribution 
for  /3j,  which  is  given  by 

PiPi  I  P,  r,  W,  y )  oc  p{yi  |  Pi,r)x  p(f. 3.t  \  (3 ,  W) 

but  a  Metropolis-Hastings  step  can  be  used  (to  give  a  Metropolis  within  Gibbs 
algorithm,  as  described  in  Sect.  3.8.5). 
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9.18.2  Inference  for  Functions  of  Interest 

We  discuss  prior  choice  and  inferential  summaries  in  the  context  of  fitting  a  NLMM 
to  the  theophylline  data.  For  these  data,  the  parameterization 


Pi  =  [log  &ei,  log  fcai,  log  C7f] 


was  initially  adopted,  with  random  effects  normal  prior  /3i  |  /3,  D  ~ud  N:j  (  (3.  D  ). 
We  assume  independent  normal  priors  for  the  elements  of  (3,  centered  at  0  and 
with  large  variances  (recall  that  we  need  proper  priors).  For  £)-1,  we  assume  a 
Wishart(?\  It  1 )  distribution  with  diagonal  It  (see  Sect.  8.6.2  and  Appendix  D  for 
discussion  of  the  Wishart  distribution).  We  describe  the  procedure  that  is  followed 
in  order  to  choose  the  diagonal  elements. 

Consider  a  generic  univariate  “natural”  parameter  9  (e.g.,  ke,  ka ,  or  Cl)  for  which 
we  assume  the  lognormal  prior  LogNormf/i,  D).  Pharmacokineticists  have  insight 
into  the  coefficient  of  variation  for  9,  that  is,  C V(9)  =  sd(0)/E[0].  Recall  the  first 
two  moments  of  a  lognormal 

E[0]  =  exp(/3  +  D/2) 
var(0)  =  E[0]2[exp(Z9)  —  1] 
sd(0)  =  E[0]\/exp(£))  —  1 
«  E [9\VD 

so  that 

CV(0)  «  Vd. 

We  can  therefore  assign  a  prior  for  D  by  providing  a  prior  estimate  of  \CT).  Under 
the  Wishart  parameterization,  we  have  adopted  E\D  _1]  =  rR  1 .  We  take  r  =  3 
(which  is  the  smallest  integer  that  gives  a  proper  prior)  and  R  =  diag(l/5, 1/5, 1/5) 
which  gives  E [D^]  =  15  so  that,  for  k  =  1, 2,  3,  E [V^Dfcfc]  ~  l/\/l5  =  0.26,  or 
an  approximate  prior  expectation  of  the  coefficient  of  variation  of  26%,  which  is 
reasonable  in  this  context  (Wakefield  et  al.  1999). 

For  inference,  again  consider  a  generic  parameter  9  with  prior  LogNorm(/3,  D). 
The  mode,  median,  and  mean  of  the  population  distribution  of  9  are 

exp  {P-Vd),  exp(/3),  exp(/3  +  D/2), 

respectively.  Further,  exp(  /?  ±  1 ,96\/lD )  is  a  95%  interval  for  9  in  the  population. 
Consequently,  given  samples  from  the  posterior  p(j3,  D  \  y),  one  may  simply 
convert  to  samples  for  any  of  these  summaries. 

In  a  pharmacokinetic  context,  interest  often  focuses  on  various  functions  of  the 
natural  parameters.  As  a  first  example,  consider  the  terminal  half-life  which  is  given 
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by  ti/2  =  ke  1log2.  In  the  parameterization  adopted  in  the  theophylline  study, 
log  ke  ~  N(  /?i ,  -Du  ),  and  so  the  distribution  of  the  log  half-life  is  normal  also: 


log <i/2  ~  N[log(log  2)  -/3i,Dn] 


which  simplifies  inference  since  one  can  summarize  the  population  distribution  in 
the  same  way  as  was  just  described  for  a  generic  parameter  0.  Other  parameters 
of  interest  are  not  simple  linear  combinations,  however.  For  example,  the  time  to 
maximum  is 


and  the  maximum  concentration  is 


For  such  summaries,  the  population  distribution  may  be  examined  by  simulating 
parameter  sets  [log  ke,  log  ka,  log  Cl]  for  new  individuals  from  the  population 
distribution,  and  then  converting  to  the  functions  of  interest. 

As  noted  in  Sect.  9.16,  the  parameterization  [log  ke,  log  ka ,  logCZ]  that  we  have 
adopted  is  non-identifiable  since  the  same  likelihood  values  are  achieved  with  the 
set  [log  ka,  log  ke,  log  Cl].  For  the  theophylline  data,  we  performed  MCMC  with 
two  chains,  and  one  of  the  chains  “flipped”  between  the  two  non-identifiable  regions 
in  the  parameter  space,  as  illustrated  in  Fig.  9.11  (note  that  in  panels  (a)  and  (b),  the 
vertical  axes  have  the  same  scale).  In  this  plot  the  three  population  parameters  0i, 
02,  03  are  plotted  in  the  three  rows.  Here,  the  labeling  of  0i  and  02  is  arbitrary.  The 
parameter  0 3  is  unaffected  by  the  flip-flop  behavior  because  the  mean  log  clearance 
is  the  same  under  each  nonidentifiable  set.  In  Fig.  9.1 1(a),  the  chain  represented  by 
the  solid  line  corresponds  to  the  smaller  of  the  two  rate  constants  and,  after  a  small 
period  of  burn-in,  remains  in  the  region  of  the  parameter  space  corresponding  to 
the  smaller  constant.  In  contrast,  the  chain  represented  by  the  dotted  line  flips  to  the 
region  corresponding  to  the  larger  rate  constant  at  around  (thinned)  iteration  number 
200.  In  panel  (b),  we  see  that  the  dotted  chain  flips  the  other  way,  as  it  is  required 
to  do. 

We  now  constrain  the  parameters  by  enforcing  the  known  ordering  on  the  rates: 
kai  >  kei  >  0.  To  avoid  the  flip-flop  problem,  we  use  the  parameterization 


On  =  log  kei  =  01  +  bu 

02i  =  log(fcai  -  kei)  =02+  b2i 

03i  =  log  Cli  =03  +  b3i 


(9.37) 

(9.38) 

(9.39) 
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Fig.  9.11  Demonstration  of 
flip-flop  behavior  for  the 
theophylline  data  and  the 
unconstrained 
parameterization  given  by 
(9.34)-(9.36):  (a)  ft,  (b)  ft, 
(c)  ft .  Thinned  realizations 
from  two  chains  appear  in 
each  plot 


0  500  1000  1500  2000 

Thinned  Iteration  Number 


with  bi  =  [&ij,  fti,  ^3i]T  ~  N3(0,  D).  This  is  a  different  model  to  the  model  that 
does  not  prevent  flip-flop  since  the  prior  inputs  are  different.  In  this  case,  we  keep 
the  same  priors  which  correspond  to  assuming  that  the  coefficient  of  variation  for 
ka  —  ke  is  around  26%  which  is  clearly  less  meaningful,  but  in  this  example,  k„  is 
considerably  larger  than  ke. 


9. 19  Generalized  Estimating  Equations 
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We  can  convert  to  the  original  parameters  via 

kei  =  exp(9u) 

kai  =  exp(6»i  i)  +  exp(02i) 

CU  =  exp(03i). 


Inference  for  the  population  distribution  of  kei  and  67,  is  straightforward,  but  for 
kai ,  more  work  is  required.  However,  the  expectation  of  the  population  absorption 
rate  is 


E  [kai]  =  E[exp(0ii)  +  exp(02i)] 

=  exp  (jli  +  V Du/2)  +  exp(/3i  +  \] Du/ 2^  . 

A  full  Bayesian  analysis  is  postponed  until  later  in  the  chapter  (at  the  end  of 
Sect.  9.20). 


9.19  Generalized  Estimating  Equations 

If  interest  lies  in  population  parameters,  then  we  may  use  the  estimator  7  that 
satisfies 

m 

G( 7,  S)  =  £  D]Wr\Yi  -  fi)  =  0,  (9.40) 

i=i 

where  D,  =  df,  /d~f,  Wi  =  .  a)  is  the  working  covariance  model,  f  ,  = 

fi( 7),  and  a  is  a  consistent  estimator  of  a.  Sandwich  estimation  may  be  used  to 
obtain  an  empirical  estimate  of  the  variance  V^. 


-1 


Y.D  W'  'D' 


\i=  1 


Y^DlWr^oviY^W-1  D, 


,i=  1 


-1 


Y^d/w-'d, 


^i=l 


(9.41) 


We  then  have  the  usual  asymptotic  result:  E,_1/2(7  —  7)  — ^  N(0, 1). 

GEE  has  not  been  extensively  used  in  a  nonlinear  (non-GLM)  setting.  This 
is  partly  because  in  many  settings  (e.g.,  pharmacokinetics/pharmacodynamics), 
interest  focuses  on  understanding  between-individual  variability,  and  explaining 
this  in  terms  of  individual- specific  covariates,  or  making  predictions  for  particular 
individuals.  The  interpretation  of  the  parameters  within  a  GEE  implementation  is 
also  not  straightforward.  For  a  marginal  GLM,  there  is  a  link  function  and  a  linear 
predictor  which  allows  interpretation  in  terms  of  differences  in  averages  between 
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Table  9.14  GEE  estimates 

PK  label 

Parameter 

Est. 

(s.e.) 

of  marginal  parameters  for 
the  theophylline  data 

log  ke 
log  ka 

71 

72 

-2.52 

0.40 

(0.068) 

(0.17) 

log  Cl 

73 

-3.25 

(0.076) 

populations  defined  by  covariates;  see  Sect.  9.1 1.  Consider  a  nonlinear  model  over 
time.  In  a  mixed  model,  the  population  mean  parameters  are  averages  of  individual- 
level  parameters.  A  marginal  approach  models  the  average  response  as  a  nonlinear 
function  of  time,  and  the  parameters  do  not,  in  general,  have  interpretations  as 
averages  of  parameters.  Rather,  parameters  within  a  marginal  nonlinear  model 
determine  a  population-averaged  curve.  The  parameters  can  be  made  a  function  of 
covariates  such  as  age  and  gender,  but  the  interpretation  is  less  clear  when  compared 
to  a  mixed  model  formulation.  For  example,  in  (9.29),  we  model  the  individual- 
level  log  clearance  as  a  function  of  a  covariate  27.  We  could  include  covariates 
in  the  marginal  model  in  an  analogous  fashion,  but  it  is  not  individual  clearance 
we  are  modeling,  and  the  subsequent  analysis  cannot  be  used  in  the  same  way  to 
derive  optimal  doses  as  a  function  of  x,  for  example.  Obviously,  GEE  cannot  provide 
estimates  of  between-individual  variability  or  obtain  predictions  for  individuals. 


Example:  Pharmacokinetics  of  Theophylline 


GEE  was  implemented  with  mean  model 


=  Mi) 


Dj  exp  (7r  +72) 
exp (73)  [exp (72)  -  exp(7i)] 


[exp(— e7lfjj)  -  exp(— e72fjj)] . 

(9.42) 


As  just  discussed,  the  interpretation  of  the  parameters  for  this  model  is  not 
straightforward  since  we  are  simply  modeling  a  population-averaged  curve.  So, 
for  example,  ke  =  exp(7i)  is  the  rate  of  elimination  that  defines  the  population- 
averaged  curve  and  is  not  the  average  elimination  rate  in  the  population. 

We  use  working  independence  (Wj  =  InJ  so  that  (9.40)  is  equivalent  to  a 
nonlinear  least  squares  criteria,  which  allows  the  estimates  to  be  found  using  stan¬ 
dard  software.  The  variance  estimate  (9.41)  simplifies  under  working  independence, 
and  the  most  tedious  part  is  evaluating  the  n,  x  3  matrix  of  partial  derivatives 
Di  =  dfi/dy.  The  estimates  and  standard  errors  are  given  in  Table  9.14.  It  is 
not  possible  to  directly  compare  these  estimates  with  those  obtained  from  a  mixed 
model  formulation. 


9.20  Assessment  of  Assumptions  for  General  Regression  Models 
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9.20  Assessment  of  Assumptions  for  General 
Regression  Models 

Model  checking  proceeds  as  with  the  linear  model  with  dependent  data  (Sect.  8.8) 
except  that  interpretation  is  not  as  straightforward  since  the  properties  of  residuals 
are  difficult  to  determine  even  when  the  model  is  correct.  We  focus  on  generalized 
and  nonlinear  mixed  models.  For  both  of  these  classes  Pearson  (stage  one)  residuals, 

,,  _Yij-E[Yij\bi] 

13  -J var(Fij  |  bi) 


are  straightforward  to  calculate. 

With  respect  to  mixed  models,  as  with  the  LMM,  there  are  assumptions  at  each  of 
the  stages,  and  one  should  endeavor  to  provide  checks  at  each  stage.  If  we  are  in  the 
situation  in  which  there  are  individuals  with  sufficient  data  to  reliably  estimate  the 
parameters  from  these  data  alone,  we  should  use  the  resultant  estimates  to  provide 
checks.  Residuals  from  individual  fits  can  be  used  to  assess  whether  the  nonlinear 
model  is  appropriate  and  if  the  assumed  variance  model  is  appropriate.  One  may 
also  construct  normal  QQ  plots  and  bivariate  plots  of  the  estimated  individual-level 
parameters  to  see  if  the  second-stage  normality  assumption  appears  reasonable.  In 
a  nonlinear  setting,  there  are  few  results  availability  on  consistency  of  estimates, 
unless  the  model  is  correct,  and  so  it  is  far  more  important  to  have  random  effects 
distributions  that  are  approximately  correctly  specified. 

If  individual-level  covariates  are  available,  then  the  estimated  parameters  may 
be  plotted  against  these  to  determine  whether  a  second-stage  regression  model  is 
appropriate  (if  we  are  in  exploratory  mode).  In  the  pharmacokinetic  context,  one 
may  model  clearance  as  a  function  of  weight,  for  example,  via  a  loglinear  model 
as  in  (9.29).  Examining  whether  the  spread  of  the  random  effects  estimates  changes 
with  covariates  is  also  an  important  step. 

All  of  the  above  checks  can  be  carried  out  based  on  the  (shrunken)  estimates 
obtained  from  random  effects  modeling,  but  caution  is  required  as  these  estimates 
may  be  strongly  influenced  by  the  assumption  of  normality.  If  n,  is  large,  then  this 
will  be  less  problematic. 


Example:  Pharmacokinetics  of  Theophylline 

We  present  some  diagnostics  for  the  theophylline  data.  We  first  carry  out  individual 
fitting  using  nonlinear  least  squares  (which  is  possible  here  since  rii  =  11),  and 
Fig.  9.12  gives  normal  QQ  plots  of  the  log  ke,  log  ka ,  and  log  Cl  parameters.  There 
is  at  least  one  outlying  individual  here,  but  there  is  nothing  too  worrying  in  these 
plots. 
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Fig.  9.12  Normal  QQ  plots  ( left  column)  and  scatterplots  (right  column )  of  the  parameter 
estimates  from  individual  nonlinear  least  square  fits  for  the  theophylline  data,  (a)  QQ  plot  for 
log  fee,  (b)  log  ka  versus  log  fce,  (c)  QQ  plot  for  log  ka,  (d)  log  Cl  versus  log  fce,  (e)  QQ  plot  for 
log  Cl,  (f)  log  Cl  versus  log  ka 


In  the  following,  a  number  of  mixed  models  are  fitted  in  an  exploratory  fashion 
in  order  to  demonstrate  some  of  the  flexibility  of  NLMMs.  We  first  fit  a  mixed 
model  using  MLE  and  the  nonlinear  form  (9.33).  The  error  terms  were  assumed  to 
be  normal  on  the  concentration  scale,  with  constant  variance.  Plots  of  the  Pearson 
residuals  versus  fitted  value  and  versus  time  are  displayed  in  Figs.  9.13(a)  and  (b). 

Figure  9.13(b)  suggests  the  variance  changes  with  time  (or  that  the  model  is 
inadequate  for  time  points  close  to  0),  and  we  carry  out  another  analysis  with  the 
model 


log  yzj  =  log  (nij)  +  5ij 


(9.43) 


9.20  Assessment  of  Assumptions  for  General  Regression  Models 


491 


"O 

10 

0 

CC 


0 

CL 


T3 

CO 

0 

CL 


-q 

co 

0 

cc 


0 

cc 


T3 

co 

0 

CC 


Fig.  9.13  Residuals  obtained  from  various  NLMM  fits  to  the  theophylline  data:  (a)  normal  model: 
residuals  against  fitted  values  (b)  normal  model:  residuals  against  time,  (c)  lognormal  model: 
residuals  against  fitted  values  (d)  lognormal  model:  residuals  against  time,  (e)  power  model: 
residuals  against  fitted  values  (f)  power  model:  residuals  against  time 


with  fjLij  again  given  by  (9.33)  and  5ij  \  a f  ~ud  N(0,cr|).  This  lognormal 
model  has  (approximately)  a  constant  coefficient  of  variation.  To  fit  this  model, 
the  responses  at  time  0  were  removed  since  fiij  =  0  for  tij  =  0.  This  time,  we 
adopt  the  parameterization  that  prevents  flip-flop,  that  is,  the  model  with  (9.37)- 
(9.39).  This  model  produced  the  Bayesian  summaries  given  in  Table  9.13  which  are 
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reasonably  consistent  with  those  in  the  normal  model.  Unfortunately,  the  residual 
plot  in  Fig.  9.13(d)  shows  only  a  slight  improvement  over  the  normal  model  in 
panel  (b). 

The  next  model  considered  was  y,j  =  pij  +  eij  with  the  power  model 

I  A‘*3,O‘0)°'l)7  ~*nd  N  (0,<7o  +0i/x7j)  (9-44) 

with  fiij  given  by  (9.33)  and  0  <  7  <  2.  This  model  has  two  components  of  variance 
and  may  be  used  when  an  assay  method  displays  constant  measurement  at  low 
concentrations  with  the  variance  increasing  with  the  mean  for  larger  concentrations. 
See  Davidian  and  Giltinan  (1995,  Sect.  2.2.3)  for  further  discussion  of  variance 
models. 

The  joint  prior  on  [cr0,  cr  1,7]  can  be  difficult  to  specify  since  there  is  dependence 
between  07  and  7  in  particular.  For  simplicity,  uniform  priors  on  the  range  [0,2] 
were  placed  on  a0  and  07 .  The  parameter  7  controls  the  strength  of  the  mean- 
variance  relationship,  and,  considering  the  second  component  only,  the  constant 
coefficient  of  variation  model  corresponds  to  7  =  2.  In  the  pharmacokinetics 
literature,  fixing  7  =  1  or  2  is  not  uncommon.  A  uniform  prior  on  [0,2]  was  specified 
for  7  also.  Figures  9.13(e)  and  (f)  show  the  residual  plots  for  this  model,  and  we 
see  some  improvement  over  the  other  two  error  models,  though  there  is  still  some 
misspecification  evident  at  low  time  points  in  panel  (f).  Further  analyses  for  these 
data  might  examine  other  absorption  models  (since  the  kinetics  may  be  nonlinear, 
which  could  explain  the  poor  fit  at  low  times). 

Posterior  summaries  for  the  power  variance  model  are  given  in  Fig.  9. 14.  The 
strong  dependence  between  07  and  7  is  evident  in  panel  (f).  There  is  a  reasonable 
amount  of  uncertainty  in  the  posterior  for  7,  but  the  median  is  0.7 1 .  The  parameter 
estimates  for  (3  and  D  are  given  in  Table  9.13  and  are  similar  to  those  from 
the  normal  and  lognormal  error  models.  Following  the  procedure  described  in 
Sect.  9.18.2,  samples  for  the  population  medians  for  ke,  and  Cl  were  generated, 
and  these  are  displayed  in  Fig.  9.15,  with  notable  skewness  in  the  posteriors  for  ka 
and  Cl. 


9.21  Concluding  Remarks 

The  modeling  of  generalized  linear  and  nonlinear  dependent  data  is  inherently  more 
difficult  than  the  modeling  of  linear  dependent  data  due  to  mathematical  tractability, 
the  required  computations  to  perform  inference  and  parameter  interpretation.  Con¬ 
ceptually,  however,  the  adaption  of  mixed  (conditional)  and  GEE  (marginal)  models 
to  the  generalized  linear  and  nonlinear  scenarios  is  straightforward.  With  respect 
to  parameter  interpretation,  the  clear  distinction  between  marginal  and  conditional 
models  is  critical  and  needs  to  be  recognized. 

There  is  little  theory  on  the  consistency  of  estimators  in  the  face  of  model 
misspecification  for  GLMMs  and  NLMMs.  This  suggests  that  one  should  be  more 
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Fig.  9.14  Posterior  summaries  for  the  two-component  power  error  model  (9.44)  fitted  to  the 
theophylline  data.  Posterior  marginals  for  cto,  or,  7  in  the  left  common  and  bivariate  plots  in 
the  right  column 


cautious  in  interpretation  of  the  results  from  GLMMs  and  NLMMs,  when  compared 
to  LMMs,  and  model  checking  should  be  carefully  carried  out.  The  effects  of 
model  misspecification  with  mixed  models  have  attracted  a  lot  of  interest.  Heagerty 
and  Kurland  (2001)  illustrate  the  bias  that  is  introduced  when  the  random  effects 
variances  are  a  function  of  covariates.  McCulloch  and  Neuhaus  (2011)  show  that 
misspecification  of  the  assumed  random  effects  distribution  has  less  impact  on 
prediction  of  random  effects.  Sensitivity  analyses,  with  respect  to  the  random  effects 
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Fig.  9.15  Posterior  distributions  from  the  power  model  (9.44)  fitted  to  the  theophylline  data: 
(a)  population  median  ke,  (b)  population  median  ka  versus  population  median  ke,  (c)  population 
median  ka,  (d)  population  median  ke  versus  population  median  Cl,  (e)  population  median  Cl, 
(f)  population  median  Cl  versus  population  median  ka 


distribution,  for  example,  can  be  useful.  The  Bayesian  approach,  with  computation 
via  MCMC,  is  ideally  suited  to  this  endeavor.  If  the  number  of  observations  per 
unit,  or  the  number  of  units,  is  small,  then  the  MCMC  route  is  appealing  because 
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one  does  not  have  to  rely  on  asymptotic  inference.  Model  checking  is  difficult  in 
this  situation,  however. 

We  have  not  discussed  REML  in  the  context  of  GLMs;  Smyth  and  Verbyla  (1996) 
show  how  REML  may  be  derived  from  a  conditional  likelihood  approach  in  the 
context  of  GLMs  with  dispersion  parameters  and  canonical  link  functions. 

The  modeling  of  dependent  binary  data  is  a  difficult  enterprise  since  binary 
observations  contain  little  information,  and  there  is  no  obvious  choice  of  multi¬ 
variate  binary  distribution.  Logistic  mixed  models  are  intuitively  appealing  but  are 
restrictive  in  the  dependence  structure  they  impose  on  the  data.  Care  in  computation 
is  required,  and  the  use  of  adaptive  Gauss-Hermite  for  MLE,  or  MCMC  for  Bayes, 
is  recommended.  As  always,  GEE  has  desirable  robustness  properties  for  large 
numbers  of  clusters.  In  the  GLM  context,  we  emphasize  the  fitting  of  both  types 
of  model  in  a  complimentary  fashion.  We  have  illustrated  how  marginal  inference 
may  be  carried  out  with  the  logistic  mixed  model,  which  allows  direct  comparison 
of  results  with  GEE. 


9.22  Bibliographic  Notes 

Liang  and  Zeger  (1986)  and  Zeger  and  Liang  (1986)  popularized  GEE  by  con¬ 
sidering  GLMs  with  dependence  within  units  (in  the  context  of  longitudinal 
data).  Prentice  (1988)  proposed  using  a  second  set  of  estimating  equations  for  a. 
Gourieroux  et  al.  (1984)  considered  the  quadratic  exponential  model.  Zhao  and 
Prentice  (1990)  discussed  the  use  of  this  model  for  multivariate  binary  data  and 
Prentice  and  Zhao  (1991)  for  general  responses  (to  give  the  approach  labelled 
GEE2).  Qaqish  and  Ivanova  (2006)  describe  an  algorithm  for  detecting  when  an 
arbitrary  set  of  logistic  contrasts  correspond  to  a  valid  set  of  joint  probabilities  and 
for  computing  them  if  they  provide  a  legal  set.  Fitzmaurice  et  al.  (2004)  is  a  very 
readable  account  of  the  modeling  of  longitudinal  data  with  GLMs,  from  a  frequentist 
(GEE  and  mixed  model)  perspective. 

An  extensive  treatment  of  Bayesian  multilevel  modeling  is  described  in  Gelman 
and  Hill  (2007).  We  have  concentrated  on  inverse  gamma  priors  for  random  effects 
variances,  but  a  popular  alternative  is  the  half-normal  prior;  see  Gelman  (2006)  for 
further  details.  Fong  et  al.  (2010)  describe  how  the  INLA  computational  approach 
may  be  used  for  GLMMs,  including  a  description  of  its  shortcomings,  in  terms  of 
accuracy,  for  the  analysis  of  binary  data.  Models  and  methods  of  analysis  for  spatial 
data  are  reviewed  in  Gelfand  et  al.  (2010). 

Davidian  and  Giltinan  (1995)  is  an  extensive  and  excellent  treatment  of  nonlinear 
modeling  with  dependent  responses,  mostly  from  a  non-Bayesian  perspective. 
Pinheiro  and  Bates  (2000)  is  also  excellent  and  covers  mixed  models  (again 
primarily  from  a  likelihood  perspective)  and  is  particularly  good  on  computation. 
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9.23  Exercises 


9. 1  Consider  the  model 


E[Y  |  b } 


exp  ((3x  +  b) 

1  +  exp(/3x  +  b)  ’ 


with  b  |  (Jq  ~ad  N(0,  CTq).  Prove  that 


E[Y] 


exp  [/3x/(c2<Jq  +  l)1/2] 

1  +  exp  \/3x/ (c2 ctq  +  l)1/2] 


where  c  =  lGx/3/ (157r). 

[Hint:  G(z )  ss  <£(c.z)  where  G(z)  =  (1  +  e-*)-1  is  the  CDF  of  a  logistic 
random  variable,  and  </'(■)  is  the  CDF  of  a  normal  random  variable.] 

9.2  Show  that  if  each  response  is  on  the  whole  real  line,  then  the  density  (9.19), 
with  Ci  =  0,  corresponds  to  the  multivariate  normal  model. 

9.3  With  respect  to  Table  9.11,  show  that,  for  a  model  for  two  binary  responses 
parameterized  in  terms  of  the  marginal  means  and  marginal  odds  ratio,  the 
likelihood  is  given  by  (9.26). 

9.4  Sommer  (1982)  contains  details  of  a  study  on  275  children  in  Indonesia. 
This  study  examined,  among  other  things,  the  association  between  the  risk 
of  respiratory  infection  and  xerophthalmia  (dry  eye  syndrome),  which  may 
be  caused  by  vitamin  A  deficiency.  These  data  are  available  in  the  R  package 
epicalc  and  are  named  Xerop. 

Consider  the  marginal  model  for  the  jth  observation  on  the  ith  child 


log 


(  E[rtJ]  \ 
VI -E \Yy\J 


=  7o  +  71  gender^-  +  72  hf  ora^-  +  73  cos ^  + 


74  siiiy  +  75  xeroy  +  76  agezj  +  77  age?- 


(9.45) 


where: 

•  Yij  is  the  absence/presence  of  respiratory  infection. 

•  gendery  is  the  gender  (0  =  male,  1  =  female). 

•  hf  oray  is  the  height-for-age. 

•  cosy  is  the  cosine  of  time  of  measurement  i.  j  (time  is  in  number  of 
quarters). 

•  s  iriy  is  the  sine  of  time  of  measurement  i,  j  (time  is  in  number  of  quarters). 

•  xerOy  is  the  absence/presence  (0/1)  of  xerophthalmia. 

•  agey  is  the  age. 

See  Example  9.4  of  Diggle  et  al.  (2002)  for  more  details  on  this  model. 

(a)  Interpret  each  of  the  coefficients  in  (9.45). 

(b)  Provide  parameter  estimates  and  standard  errors  from  a  GEE  analysis. 
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(c)  Consider  a  GLMM  logistic  analysis  with  a  normally  distributed  random 
intercept  and  the  conditional  version  of  the  regression  model  (9.45). 
Interpret  the  coefficients  of  this  model. 

(d)  Provide  parameter  estimates  and  standard  errors  from  the  GLMM  analysis. 

(e)  Summarize  the  association  between  respiratory  infection  and  xeropthalmia 
and  age. 

9.5  On  the  book  website,  you  will  find  data  on  illiteracy  and  race  collected  during 
the  US  1930  census.  Wakefield  (2009b)  provides  more  information  on  these 
data.  Illiterate  is  defined  as  being  unable  to  read  and  over  10  years  of  age.  For 
each  of  the  i  =  1, . . . ,  49  states  that  existed  in  1930,  the  data  consist  of  the 
number  of  illiterate  individuals  Y/,-  and  the  total  population  aged  10  years  and 
older  Nij  by  race,  coded  as  native-born  White  (j  =  1),  foreign-born  White 
(j  =  2),  and  Black  (j  =  3).  Let  p,:l  be  the  probability  of  being  illiterate 
for  an  individual  residing  in  state  i  and  of  race  j.  An  additional  binary  state- 
level  variable  Xi  =  0/1  describes  whether  Jim  Crow  laws  were  absent/present 
in  state  i  =  1, . . .  ,49.  These  laws  enforced  racial  segregation  in  all  public 
facilities. 

The  association  between  illiteracy  and  race,  state,  and  Jim  Crow  laws  will 
be  examined  using  logistic  regression  models.  In  particular,  interest  focuses 
on  whether  illiteracy  in  1930  varied  by  race,  varied  across  states,  and  was 
associated  with  the  presence/absence  of  Jim  Crow  laws: 

(a)  Calculate  the  empirical  logits  of  the  Pij’s,  and  provide  informative  plots 
that  graphically  display  the  association  between  illiteracy  and  state,  race, 
and  Jim  Crow  laws. 

(b)  First  consider  the  native-born  White  data  only  (Yu,  No),  i  =  1, . . .  ,49, 
with  the  following  models: 

•  Binomial:  Yu  \  pn  ~  Binomial(A^i,pji),  with  the  logistic  model 


log  ^ 


(9.46) 


for  i  =  1, . . . ,  49. 

•  Quasi-Likelihood:  Model  (9.46)  with 

E[Y»i]  =  Nupn,  var(Y)i)  =  k  x  Napn(l  -  pa). 

•  GEE:  Model  (9.46)  with  E[l/i]  =  A/iP-n  and  working  independence. 

•  GLMM 


with  6,i  |  a 


.2 

l 


■ad  N(0,af),  i  =  1, ...  ,49. 


(9.47) 
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(i)  Give  careful  definitions  of  exp(7i)  in  the  GEE  model  and  exp(/3i)  in 
the  GLMM. 

(ii)  Fit  the  binomial  model  to  the  native-born  White  data  and  give  a  95% 
confidence  interval  for  the  odds  of  native-born  White  illiteracy.  Is  this 
model  appropriate? 

(iii)  Fit  the  quasi-likelihood  and  GEE  models  to  the  native-born  White  data 
and  give  95%  confidence  interval  for  the  odds  of  native-born  White 
illiteracy  in  each  case.  How  does  the  GEE  approach  differ  from  quasi¬ 
likelihood  here?  Which  do  you  prefer? 

(iv)  Fit  the  GLMM  model  to  the  data  using  a  likelihood  approach  and  give 
a  95%  confidence  interval  for  the  odds  of  native-born  White  illiteracy 
along  with  an  estimate  of  the  between-state  variability  in  logits.  Are 
the  results  consistent  with  the  GEE  analysis? 

(c)  Now  consider  data  on  all  three  races.  Using  GEE  fit,  separate  models  to 
the  data  of  each  race.  Give  a  95%  confidence  interval  for  the  odds  ratios 
comparing  illiteracy  between  foreign-born  Whites  and  native-born  Whites, 
and  comparing  Blacks  with  native-born  Whites.  Is  there  any  problem  with 
this  analysis? 

(d)  Use  GEE  to  fit  a  model  to  all  three  races  simultaneously  and  compare  your 
answer  with  the  previous  part.  Which  analysis  is  the  most  appropriate  and 
why? 

(e)  Fit  the  GLMM 

<9-48> 

with  bij  |  <rj  N(0,  <7?),  j  =  1, 2,  3,  using  likelihood-based  methods. 
Give  95%  confidence  intervals  for  the  odds  ratios  comparing  illiteracy 
between  foreign-born  Whites  and  native-born  Whites,  and  comparing 
Blacks  with  native-born  Whites.  Are  your  conclusions  the  same  as  with 
the  GEE  analysis?  Does  this  model  require  refinement? 

(f)  The  state-level  Jim  Crow  law  indicator  will  now  be  added  to  the  analysis. 
Consider  the  model 


log 


7oj  +7i  jXi 


(9.49) 


Give  interpretations  of  each  of  exp(7oj),  exp(7y)  for  j  =  1,  2, 3.  Fit  this 
model  using  GEE  and  interpret  and  summarize  the  results  in  a  clear  fashion, 

(g)  Consider  Bayesian  fitting  of  the  GLMM: 


_Py_\ 

1  -  Pij  J 


Poj  T  Pi  j  %i  T  bij 


log 


(9.50) 
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where  6,  |  D  ~iid  N3(  0,  D  )  with  b,  =  [bn,  bi2,  6l3]T  and 


D  = 


Pl20l°2  Pl3(Jl(J3 
P12<?2CTI  er|  P23&2&3 
.P13&3&1  P23&3&2  cr3 


is  a  3  x  3  variance-covaraince  matrix  for  the  random  effects  b, .  Assume 
improper  flat  priors  for  Ay,  Pij,  j  =  1,2,3,  and  the  Wishart  prior  W  = 
D^1  ~  Wishart(r,  S)  parameterized  so  that  E[W]  =  rS,  with  r  =  3  and 


S  = 


30.45  0  0 

0  30.45  0 

0  0  30.45 


Carry  out  a  Bayesian  analysis  using  this  model  and  interpret  and  summarize 
the  results  in  a  clear  fashion. 

(h)  Write  a  short  summary  of  what  you  have  found,  concentrating  on  the 
particular  substantive  questions  of  interest  stated  in  the  introduction. 

9.6  For  the  theophylline  data  considered  in  this  chapter,  reproduce  the  results  in 
Table  9. 14  by  coding  up  the  nonlinear  GEE  model  with  working  independence. 
These  data  are  available  as  Theoph  in  the  R  package. 

9.7  Throughout  this  chapter,  mixed  models  with  clustering  induced  by  normally 
distributed  random  effects  have  been  considered.  In  this  question,  a  non-normal 
random  effects  distribution  will  be  considered.  Suppose,  for  paired  binary 
observations,  that  the  data-generating  mechanism  is  the  following: 

1  ij  |  Pi'j  ^vnd  Bernoulli(//,7 ) , 
for  i  =  1 , ,n,j  =  1,2,  with 

_  exp(/30  +  PiXjj  +  bj) 

^  1  +  exp(/30  +  piXij  +  bt) 

b  f  with  probability  1/2 

1  \  7  with  probability  1/2. 

and  Xjj  r^iid  Unif(— 10, 10).  The  parameters  /3i  £  R.  and  7  >  0  are  unknown, 
and  all  h,  are  independent  and  identically  distributed.  For  simplicity,  assume 
/?o  =  0  throughout: 

(a)  For  0  <  /?i  <  1  and  0  <  7  <  5,  calculate  the  correlation  between  the 
outcomes  and  Y,y  within  cluster  i,  averaged  over  the  distribution  of 
clusters. 
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(b)  For  /3i  =  1  and  0  <  7  <  5,  calculate  the  numerical  value  of  the  true  slope 
parameter  estimated  by  a  GEE  logistic  regression  model  of  y  on  x,  with 
working  independence  within  clusters.  Compare  this  value  to  the  true  /3i. 

(c)  Consider  a  study  with  paired  observations  and  binary  outcomes  (e.g.,  a 
matched-pairs  case-control  study  as  described  in  Sect.  7.10.3).  The  true 
data-generating  mechanism  is  as  above  with  /3i  =  1,7  =  5.  First  plot  y 
versus  x  for  all  observations  and  add  a  smoother.  This  plot  seems  to  indicate 
that  there  are  low-,  medium-,  and  high-risk  subjects,  depending  on  levels  of 
x. 

(d)  In  truth,  of  course,  there  are  not  three  levels  of  risk.  For  some  example  data, 
give  a  plot  that  illustrates  this  and  write  an  explanation  of  what  your  plot 
shows.  The  plot  should  use  only  observed,  and  not  latent,  variables. 


Part  IV 

Nonparametric  Modeling 


Chapter  10 

Preliminaries  for  Nonparametric  Regression 


10.1  Introduction 

In  all  other  chapters  we  assume  that  the  regression  model,  /( x),  takes  an  a 
priori  specified,  usually  simple,  parametric  form.  Such  models  have  a  number  of 
advantages:  If  the  assumed  parametric  form  is  approximately  correct,  then  efficient 
estimation  will  result;  having  a  specific  linear  or  nonlinear  form  allows  concise 
summarization  of  an  association;  inference  for  parametric  models  is  often  relatively 
straightforward.  Further,  a  particular  model  may  be  justifiable  from  the  context. 

In  this  and  the  following  two  chapters,  we  consider  situations  in  which  a  greater 
degree  of  flexibility  is  desired,  at  least  when  modeling  some  components  of  the 
covariate  vector  x.  Nonparametric  modeling  is  particularly  useful  when  one  has 
little  previous  experience  with  the  specific  data-generating  context.  Typically,  one 
may  desire  /(•)  to  arise  from  a  class  of  functions  with  restrictions  on  smoothness 
and  continuity.  Although  the  models  of  this  and  the  next  two  chapters  are  referred 
to  as  nonparametric,1  they  often  assume  parametric  forms  but  depend  on  a  large 
number  of  parameters  which  are  constrained  in  some  way,  in  order  to  prevent 
overfitting  of  the  data.  For  some  approaches,  for  example,  the  regression  tree  models 
described  in  Sect.  12.7,  the  model  is  specified  implicitly  through  an  algorithm,  with 
the  specific  form  (including  the  number  of  parameters)  being  selected  adaptively. 

There  are  a  number  of  contexts  in  which  flexible  modeling  is  required.  The 
simplest  is  when  a  graphical  description  of  a  set  of  data  is  needed,  which  is  often 
referred  to  as  scatterplot  smoothing.  Formal  inference  is  also  possible  within  a 
nonparametric  framework,  however.  In  some  circumstances,  estimation  of  a  para¬ 
metric  relationship  between  a  response  and  an  x  variable  may  be  of  interest,  while 
requiring  flexible  nonparametric  modeling  of  other  nuisance  variables  (including 
confounders).  The  example  described  in  Sect.  1.3.6  is  of  this  form,  with  the 
association  between  spinal  bone  mineral  density  and  ethnicity  being  of  primary 


1  Some  authors  prefer  the  label  semiparametric. 
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interest,  but  with  a  flexible  model  for  age  being  desired.  Finally,  an  important  and 
common  use  of  nonparametric  modeling  is  prediction.  In  this  case,  the  focus  is  on 
the  accuracy  of  the  final  prediction,  with  little  interest  in  the  values  of  the  parameters 
in  the  model.  Prediction  with  a  discrete  outcome  is  often  referred  to  as  classification. 

Much  of  the  development  of  nonparametric  methods,  in  particular  those  asso¬ 
ciated  with  classification,  has  occurred  in  computer  science  and,  more  specifically, 
machine  learning,  with  a  terminology  that  is  quite  different  to  that  encountered  in 
the  statistics  literature.  The  data  with  which  the  model  is  fitted  constitute  the  training 
sample ;  nonparametric  regression  is  referred  to  as  learning  a  function',  the  covariates 
are  called  features;  and  adding  a  penalty  term  to  an  objective  function  (e.g.,  a 
residual  sum  of  squares)  is  called  regularization.  In  supervised  learning  problems, 
there  is  an  outcome  variable  that  we  typically  wish  to  predict,  while  in  unsupervised 
learning  there  is  no  single  outcome  to  predict,  rather  the  aim  is  to  explore  how  the 
data  are  organized  or  clustered.  Only  supervised  learning  is  considered  here. 

The  layout  of  this  chapter  is  as  follows.  In  Sect.  10.2,  we  discuss  a  number 
of  motivating  examples.  Section  10.3  examines  what  response  summary  should 
be  reported  in  a  prediction  setting  using  a  decision  theory  framework,  while  in 
Sect.  10.4  various  measures  of  predictive  accuracy  are  reviewed.  A  recurring 
theme  will  be  the  bias-variance  trade-off  encountered  when  fitting  flexible  models 
containing  a  large  number  of  parameters.  To  avoid  excess  variance  of  the  prediction, 
various  techniques  that  reduce  model  complexity  will  be  described;  a  popular 
approach  is  to  penalize  large  values  of  the  parameters.  This  concept  is  illustrated 
in  Sect.  10.5  with  descriptions  of  ridge  regression  and  the  lasso.  These  shrinkage 
methods  are  introduced  in  the  context  of  multiple  linear  regression.2  Controlling 
the  complexity  of  a  model  is  a  key  element  of  nonparametric  regression  and  is 
usually  carried  out  using  smoothing  (or  tuning)  parameters.  In  Sect.  10.6,  smoothing 
parameter  estimation  is  considered.  Concluding  comments  appear  in  Sect.  10.7. 
There  is  a  huge  and  rapidly  growing  literature  on  nonparametric  modeling,  and  the 
surface  is  only  scratched  here;  Sect.  10.8  gives  references  to  broader  treatments  and 
to  more  detailed  accounts  of  specific  techniques. 

The  next  two  chapters  also  consider  nonparametric  modeling.  In  Chap.  11,  two 
popular  approaches  to  smoothing  are  described:  Those  based  on  splines  and  those 
based  on  kernels;  the  focus  of  the  latter  is  local  regression.  Chapter  1 1  only  considers 
situations  with  a  single  covariate,  with  multiple  predictors  considered  in  Chap.  12, 
along  with  methods  for  classification. 


10.2  Motivating  Examples 

Three  examples  that  have  been  previously  introduced  will  be  used  for  illustrating 
nonparametric  modeling:  The  prostate  cancer  data  described  in  Sect.  1.3.1  are  used 
for  illustration  in  this  chapter  and  in  Chap.  12;  the  spinal  bone  marrow  data  of 


2  Ridge  regression  is  also  briefly  encountered  in  Sect.  5.12. 
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Fig.  10.1  Log  ratio  of  two 
laser  sources,  as  a  function  of 
the  range,  in  the  LIDAR  data 


Range  (m) 


Sect.  1.3.6  will  be  analyzed  in  Chap.  11;  and  the  bronchopulmonary  dysplasia  data 
described  in  Sect.  7.2.3  will  be  examined  in  Chaps.  11  and  12.  In  this  section,  two 
additional  datasets  are  described. 


10.2.1  Light  Detection  and  Ranging 

Figure  10.1  shows  data,  taken  from  Flolst  et  al.  (1996),  from  a  light  detection  and 
ranging  (LIDAR)  experiment.  The  LIDAR  technique  (which  is  similar  to  radar 
technology)  uses  the  reflection  of  laser-emitted  light  to  monitor  the  distribution  of 
atmospheric  pollutants.  The  data  we  consider  concern  mercury.  The  x-axis  measures 
distance  traveled  before  light  is  reflected  back  to  its  source  (and  is  referred  to  as  the 
range),  and  the  y-axis  is  the  logarithm  of  the  ratio  of  distance  measured  for  two  laser 
sources:  One  source  has  a  frequency  equal  to  the  resonant  frequency  of  mercury, 
and  the  other  has  a  frequency  off  this  resonant  frequency.  For  these  data,  point  and 
interval  estimates  for  the  association  between  the  log  ratio  and  range  are  of  interest. 
Figure  10.1  shows  a  clear  nonlinear  relationship  between  the  log  ratio  and  range, 
with  greater  variability  at  larger  ranges. 


10.2.2  Ethanol  Data 

This  example  concerns  data  collected  in  a  study  reported  by  Brinkman  (1981).  The 
data  consist  of  n  =  88  measurements  on  three  variables:  NOx,  the  concentration 
of  nitric  oxide  (NO)  and  nitrogen  dioxide  (N02)  in  the  engine  exhaust,  with 
normalization  by  the  work  done  by  the  engine;  C,  the  compression  ratio  of  the 
engine;  and  E,  the  equivalence  ratio  at  which  the  engine  was  run,  a  measure  of 
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Fig.  10.2  A  three- 
dimensional  display  of  the 
ethanol  data,  showing  the 
normalized  concentration  of 
nitric  oxide  and  nitrogen 
dioxide  (NOx)  as  a  function 
of  the  equivalance  ratio  at 
which  the  engine  was  run  (E) 
and  the  compression  ratio  of 
the  engine  (C) 


E 


the  air/ethanol  mix.  Figure  10.2  gives  a  three-dimensional  display  of  these  data.  The 
aim  is  to  build  a  predictive  model,  and  a  simple  linear  model  is  clearly  inadequate 
since  there  is  a  strong  nonlinear  (inverse  U-shaped)  association  between  NOx  and  E. 
The  form  of  the  association  between  NOx  and  C  is  less  clear. 


10.3  The  Optimal  Prediction 

Before  considering  model  specification  and  describing  methods  for  fitting,  we  use  a 
decision  theory  framework  to  decide  on  which  summary  of  the  distribution  of  Y  \  x 
we  should  report  if  the  aim  of  analysis  is  prediction,  where  x  i s  a  1  x  (k  +  1 )  design 
vector  corresponding  to  the  intercept  and  k  covariates.  Throughout  this  section, 
we  will  suppose  we  are  in  an  idealized  situation  in  which  all  aspects  of  the  data- 
generating  mechanism  are  known,  and  we  need  only  decide  on  which  quantity  to 
report. 

The  specific  decision  problem  we  consider  is  the  following.  Imagine  we  are 
involved  in  a  game  in  which  the  aim  is  to  predict  a  new  observation  y,  using 
a  function  of  covariates  x,  f(x).  Further,  we  know  that  our  predictions  will  be 
penalized  via  a  loss  function  L[y,  f{x )]  that  is  the  penalty  incurred  when  predicting 
y  by  f(x).  The  optimal  prediction  is  that  which  minimizes  the  expected  loss 
defined  as 

E  x,y{L[YJ(X)]},  (10.1) 

where  the  expectation  is  with  respect  to  the  joint  distribution  of  the  random  variables 
Y  and  X. 
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10.3.1  Continuous  Responses 


The  most  common  choice  of  loss  function  is  squared  error  loss,  with  f(x)  chosen 
to  minimize  the  expected  (squared)  prediction  error: 

E  x,y{[F-/(X)]2},  (10.2) 

that  is,  the  quadratic  loss.  Writing  (10.2)  as 


Ex 


{p Y-f(x)]2\X  =  x }] 


indicates  that  we  may  minimize  pointwise,  with  solution 

f(x)  =  E [Y  |  x], 


that  is,  the  conditional  expectation  (Exercise  10.4).  Hence,  the  best  prediction, 
f(x),  is  the  usual  regression  function. 

As  an  alternative,  with  absolute  loss,  Ex  y[  | Y  —  f(X) |  ],  the  solution  is  the 
conditional  median 

/( x)  =  median(Y  |  a;) 

(Exercise  10.4).  Modeling  via  the  median,  rather  than  the  mean,  provides  greater  ro¬ 
bustness  to  outliers  but  with  an  increase  in  computational  complexity.  Exercise  10.4 
also  considers  a  generalization  of  absolute  loss. 

Other  choices  have  also  been  suggested  for  specific  situations.  For  example,  the 
scaled  quadratic  loss  function 


L[y,f(n)\ 


y-  f(x) 


(10.3) 


has  been  advocated  for  random  variables  y  >  0  (e.g.,  Bernardo  and  Smith  1994, 
p.  301).  This  loss  function  is  scaling  departures  y  —  f(x)  by  y,  so  that  discrepancies 
in  the  predictions  of  the  same  magnitude  are  penalized  more  heavily  for  small  y 
than  for  large  y.  Taking  the  expectation  of  (10.3)  with  respect  to  Y  \  x  leads  to 


f(x) 


Efr-1 1  *] 

E [Y-2  I  x] ' 


(10.4) 


For  details,  see  Exercise  10.5.  As  an  example,  suppose  the  data  are  gamma 
distributed  as 


Y\p(x),a~iid  Ga{a  1,  [n(x)a\  1}, 
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where  E [Y  \  x]  =  p(x),  and  a  3/2  is  the  coefficient  of  variation.  Then 
Exercise  10.5  shows  that  (10.4)  is  equal  to 

j(x)  =  (1  -  2a)p(x),  (10.5) 

for  a  <  0.5.  Hence,  under  the  scaled  quadratic  loss  function,  we  should  scale  the 
mean  function  by  1  —  2a  when  reporting. 


10.3.2  Discrete  Responses  with  K  Categories 


Now  suppose  the  response  is  categorical,  with  Y  £  {0, 1, . . . ,  K  —  1}.  Again,  we 
must  decide  on  which  summary  measure  to  report.  One  approach  is  to  assign  a  class 
in  {0, 1, . . . ,  I\  —  1}  to  a  new  case  via  a  classification  rule  g{x).  Alternatively,  a 
probability  distribution  over  the  classes  may  be  reported.3 

Suppose  the  distributions  of  x  given  Y  =  k,  p(x  \  Y  =  k),  are  known  along 
with  prior  probabilities  on  the  classes,  Pr(  Y  =  k)  =  tt .  Then,  via  Bayes  theorem, 
the  posterior  classifications  may  be  obtained: 


Pr(y  =  k\x) 


p(x  |  Y  =  k)-Kk 

I  Y  =  l)ni 


(10.6) 


Choosing  the  k  that  maximizes  these  probabilities  gives  a  Bayes  classifier. 

For  the  situation  in  which  we  wish  to  assign  a  class  label,  the  loss  function  is 
a  K  x  K  matrix  L  with  element  L(j,  k)  representing  the  loss  incurred  when  the 
truth  is  Y  =  j,  and  the  classification  is  g{x)  =  k,  with  j,  k  £  {0, 1, . . . ,  K  —  1}. 
A  sensible  loss  function  is 


L(j,  k) 


0  if  j  =  k 
>  0  if  j  ^  k. 


(10.7) 


In  most  cases, we  will  assign  L(j,  k)  >  0  for  ;j  k  but  in  some  contexts  incorrect 
classifications  will  not  be  penalized  if  they  are  of  no  consequence.  We  emphasize 
that  the  class  predictor  g{x)  takes  a  value  from  the  set  {0, 1, . . . ,  K  —  1}  and  is  a 
function  of  Pr(F  =  k  \  x).  The  expected  loss  is 


Ex, .  {L  [Y,  5(X)]}  =  Ex  [Erk  {L  [Y,  g(x)}  \  X  =  x}\ 


=  E, 


K- 1 


^2  L  [Y  =  k,g{x)]  Pr(y  =  k  \  x) 


fc= o 


(10.8) 


3It  is  possible  to  also  have  a  “doubt”  category  that  is  assigned  if  there  is  sufficient  ambiguity  but 
we  do  not  consider  this  possibility.  See  Ripley  (1996)  for  further  discussion. 


10.3  The  Optimal  Prediction 


509 


Table  10.1  Loss  table  for  a 
binary  decision  problem 


Predicted  Class 

it 

II 

O 

g(x)  =  1 

True 

Y  =  0 

0 

£(0,1) 

Class 

Y  =  1 

£(1,0) 

0 

where  we  are  assuming  the  form  of  g{x)  is  known.  The  inner  expectation  of  (10.8) 
is  known  as  the  Bayes  risk  (e.g.,  Ripley  1996),  with  minimum 


K- 1 

g{x)  =  argming(x)G{0) ^  L[Y  =  k,g(x)}  Pr(T  =  k  \  x). 

fc= o 

The  K  =  2  situation  will  now  be  considered  in  greater  detail.  Table  10.1  gives 
the  table  of  losses  for  this  case.  The  Bayes  risk  is  minimized  by  the  choice 

?(*)  = 

argming(x)g{0ii}{L  [Y  =  0,  g(x)]  Pr(F  =  0  |  x)  +  L  [Y  =  1,  g( as)]  Pr(T  =  1  |  *)}  . 


Hence, 


g(x)  =  0  gives  Bayes  risk  =  1/(1, 0)  x  Pr(Y  =  1  |  x) 
g[x)  =  1  gives  Bayes  risk  =  L(0, 1)  x  [1  —  Pr(y  =  1  |  a:)] 


and  so  the  Bayes  risk  is  minimized  by  g{x)  =  1  if 

1/(1, 0)  x  Pr(F  =  1  |  x)  >  L{ 0, 1)  x  [1  -  Pr(y  =  1  |  ®)] 


or  equivalently  if 


Pr(y  =  1  i  X)  L( 0,1) 

1  -  Pr(y  =  1  I  x)  L(1,0) 


(10.9) 


with  the  consequence  that  only  the  ratio  of  losses  R  requires  specification.  A  final 
restatement  is  to  classify  a  new  case  with  covariates  x  as  g{x)  =  1  if 


Pr(y  =  1  |  x)  > 


£(0,1) 

L(0,1)  +  L(1,0) 


R 

1  +  R' 


If  classifying  as  g(x)  =  1  when  Y  =  0  is  much  worse  than  classifying  as  g(x)  =  0 
when  y  =  1,  then  R  should  be  given  a  value  greater  than  1.  In  this  case,  if  Pr(y  = 
1  I  x)  >  0.5  then  we  assign  q(x)  =  1.  For  example,  if  R  =  4,  we  set  q(x)  =  1 
only  if  Pr(y  =  1  |  x)  >  0.8. 

Returning  to  the  case  of  general  K,  in  the  most  straightforward  case  of  all  errors 
being  equal,  we  simply  assign  an  observation  to  the  most  likely  class,  using  the 
probabilities  Pr(y  =  k  |  x),  k  =  0, 1, . . . ,  K  —  1. 
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We  now  turn  to  the  second  situation  in  which  a  classification  is  not  required,  but 
rather  a  set  of  probabilities  over  {0, 1, . . . ,  K  —  1},  that  is,  we  require  f(x)  = 
[/0(a:), . . . ,  fK  _  !  (ai)] .  First,  consider  the  K  =  2  (binary)  case.  In  this  case  we 
simplify  notation  and  write  f(x)  =  [/(at),  1  —  f(x)].  We  may  specify  a  loss 
function  which  is  proportional  to  the  negative  Bernoulli  log-likelihood 

L[y,  /(*)]  =  ~2ylog  [/(*)]  -  2(1  -  y)  log  [1  -  f(x)}  (10.10) 

where  f(x)  is  the  function  that  we  will  report.  Therefore,  if  the  log-likelihood  is 
high  the  loss  is  low.  The  expectation  of  (10.10)  is 

-2  Pr(F  =  1  |  x)  log  [f(x)\  -  2  [1  -  Pr(Y  =  1  |  ®)]  log  [1  -  f(x)] 

where  E  [Y  \  x]  =  Pr(Y  =  l\x)  are  the  true  probabilities,  given  covariates  x.  The 
solution  is  /(at)  =  Pr(Y  =  1  |  x).  Hence,  to  minimize  the  expected  deviance-type 
loss  function,  the  true  probabilities  should  be  reported,  which  is  not  a  great  surprise. 

In  the  general  case  of  K  classes  and  a  multinomial  likelihood  with  one  trial  and 
probabilities  /(at)  =  [/0(at), ...  ,fK  _  i(at)],  the  deviance  loss  function  is 

K-l 

L[y,f(x)\  =  -2  ^  T(Y  =  k)l°S  fk(x),  (10.11) 

k—0 

where  /(•)  is  the  indicator  function  that  equals  1  if  its  argument  is  true  and  0 
otherwise.  The  expected  loss  is 

K- 1 

-2  Pr(>"  =  k\x)  log  fk(x) 

k- 0 

which  is  minimized  by  fk{x)  =  Pr(Y  =  k  \  x). 

10.3.3  General  Responses 

In  general,  if  we  are  willing  to  speculate  on  a  distribution  for  the  data,  we  may  take 
the  loss  function  as 

L[y,  /(*)]  =  -2  log Pf(y  I  x),  (10.12) 

which  is  the  deviance  (Sect.  6.5.3),  up  to  an  additive  constant  not  depending  on  /. 
The  notation  pf  emphasizes  that  the  distribution  of  the  data  depends  on  /.  The 
previous  section  gave  examples  of  this  loss  function  for  binomial,  (10.10),  and 
multinomial,  (10.1 1),  data.  The  loss  function  (10.12)  is  an  obvious  measure  of  the 
closeness  of  y  to  the  predictor  function  f(x)  since  it  is  a  general  measure  of  the 
discrepancy  between  the  data  y  and  f{x).  When  Y  \  x  ~  N[/(x),  a2],  we  obtain 

L[y,  /(*)]  =  log  (27rcr2)  +  [y  -  /(at)]2  /a2, 
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which  produces  f{x)  =  E[Y  |  a;],  as  with  quadratic  loss.  Similarly,  choosing  a 
Laplacian  distribution,  that  is,  Y  \  x  ~  Lap [f(x),<j>\  (Appendix  D)  leads  to  the 
posterior  median  as  the  optimal  choice. 


10.3.4  In  Practice 

Sections  10.3.1  and  10.3.2  describe  which  summary  should  be  reported,  if  one  is 
willing  to  specify  a  loss  function.  Such  a  loss  function  will  often  have  been  based 
on  an  implicit  model  for  the  distribution  of  the  data  or  upon  an  estimation  method. 

For  example,  a  quadratic  loss  function  is  consistent  with  a  model  for  continuous 
responses  with  additive  errors  which  is  of  the  form 

Y  =  f(x)  +  e  (10.13) 

with  E[e]  =  0,  var(e)  =  a2  and  errors  on  different  responses  being  uncorrelated. 
This  form  may  be  supplemented  with  the  assumption  of  normal  errors  or  one  may 
simply  proceed  with  least  squares  estimation.  Modeling  proceeds  by  assuming  some 
particular  form  for  f(x).  A  simple  approach  is  to  assume  that  the  conditional  mean, 
f(x),  is  approximated  by  the  linear  model  x/3,  as  in  Chap.  5.  Alternative  nonlinear 
models  are  described  in  Chap.  6. 

Relaxing  the  constant  variance  assumption,  one  may  consider  generalized  linear 
model  (GLM)  type  situations,  to  allow  for  more  flexible  mean-variance  modeling. 
GLMs  are  also  described  in  Chap.  6.  An  assumption  of  a  particular  distributional 
form  may  be  combined  with  the  deviance-type  loss  function  (10. 12). 

In  Sect.  10.3.2  discrete  responses  were  considered,  and  we  saw  that  with  equal 
losses,  one  may  classify  on  the  basis  of  the  probabilities  Pr(Y  =  k  \  x).  As 
described  in  Chap.  12,  there  are  two  broad  approaches  to  classification.  The  first 
approach  directly  models  the  probabilities  Pr(  Y  =  k\x).  For  example,  in  the  case 
of  binary  (K  =  2)  responses,  logistic  modeling  provides  an  obvious  approach  (as 
described  in  Sect.  7.6).  Chap.  12  describes  a  number  of  additional  methods  to  model 
the  probabilities  as  a  function  of  x.  The  second  approach  is  to  assume  forms  for  the 
distributions  of  x  given  Y  =  k,  p(x  Y  =  k)  and  then  combine  these  with  prior 
probabilities  on  the  classes,  Pr(Y  =  k)  =  ttk,  to  form  posterior  classifications, 
via  (10.6);  Chapter  12  also  considers  this  situation. 


10.4  Measures  of  Predictive  Accuracy 

As  already  noted,  nonparametric  modeling  is  often  used  for  prediction,  and  so 
the  conventional  criteria  by  which  methods  of  parameter  estimation  are  compared 
(as  discussed  in  Sect.  2.2)  are  not  directly  relevant.  In  a  prediction  context,  there  is 
less  concern  about  the  values  of  the  constituent  parts  of  the  prediction  equation, 
rather  interest  is  on  the  total  contribution.  In  Sect.  10.3,  loss  functions  were 
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introduced  in  order  to  determine  how  to  report  the  prediction.  In  this  section,  loss 
functions  are  used  to  provide  an  overall  measure  of  the  “error”  of  a  procedure. 

The  generalization  error  is  defined  as 


GE(/)  =Ex,y  {L 


(10.14) 


where  f(X)  is  the  prediction  for  Y  at  a  point  X ,  with  X,  Y  drawn  from  their  joint 
distribution.  Hence,  we  are  in  the  so-called  X -random,  as  opposed  to  X -fixed,  case 
(Breiman  and  Spector  1992).  The  terminology  with  respect  to  different  measures  of 
accuracy  can  be  confusing  and  is  also  inconsistent  in  the  literature;  the  notation  used 
here  is  summarized  in  Table  10.2. 

Hastie  et  al.  (2009,  Sect.  7.2)  describe  how  one  would  ideally  split  the  data 
into  three  portions  with  one  part  being  used  to  fit  (or  train)  models,  a  second 
(validation)  part  to  choose  a  model  (which  includes  both  choosing  between  different 
classes  of  models  and  selecting  smoothing  parameters  within  model  classes),  and 
a  third  part  to  estimate  the  generalization  error  of  the  final  model  on  a  test 
dataset.  Unfortunately,  there  are  often  insufficient  data  for  division  into  three  parts. 
Consequently,  when  prediction  methods  are  to  be  compared,  a  common  approach  is 
to  separate  the  data  into  training  and  test  datasets.  The  training  data  are  used  to  train 
the  model  and  then  approximate  the  validation  step  using  methods  to  be  described  in 
Sect.  10.6.  The  test  data  are  used  to  estimate  the  generalization  error  (10.14)  using 
the  function  f(x)  estimated  from  the  training  data.  We  now  discuss  the  form  of  the 
generalization  error  for  different  data  types. 


10.4.1  Continuous  Responses 


To  gain  flexibility  and  so  minimize  bias,  predictive  models  f(x)  that  contain  many 
parameters  are  appealing.  However,  if  the  parameters  are  not  constrained  in  some 
way,  such  models  produce  wide  predictive  intervals  because  a  set  of  data  only 
contains  a  limited  amount  of  information.  In  general,  as  the  number  of  parameters 
increases,  the  uncertainty  in  the  estimation  of  each  increases  in  tandem,  which 
results  in  greater  uncertainty  in  the  prediction  also.  Consequently,  throughout  this 
and  the  next  two  chapters,  we  will  repeatedly  encounter  the  bias-variance  trade-off. 
Section  5.9  provides  a  discussion  of  this  trade-off  in  the  linear  model  context. 

The  expected  squared  prediction  error  is  a  special  case  of  the  generalization  error 
with  squared  error  loss: 


ESPE(/)  =Ex,y 


Y-f(X) 


(10.15) 


where  f(X)  is  again  the  prediction  for  Y  at  a  point  X,  with  X ,  Y  drawn  from  their 
joint  distribution. 

Estimators  /  with  small  ESPE(/)  are  sought,  but  balancing  the  bias  in  estimation 
with  the  variance  will  be  a  constant  challenge.  To  illustrate,  suppose  we  wish  to 
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Table  10.2  Summary  of  predictive  accuracy  measures 


Name 

Short-hand 

Definition 

Generalization  error 

GE  (/) 

E  x,y{l[yJ(X)]} 

Expected  squared  prediction  error 

ESPE(/) 

E^,v{[V-/(X)]2} 

Mean  squared  error  (or  risk) 

MSE  [/(aso)] 

EVn  |[/(*0)  -  /(*0)]2] 

Predictive  risk 

PR  [7(*o)] 

Ev„,v0{[To-/(®o)]2} 

=  o-2  +  MSE  [f(x0)\ 

Integrated  mean  squared  error 

IMSE  (/) 

f  Eyn  {[/(a:)  -  f(x)]2 1  p{x)  dx 

=  f  MSE  p(x)  dx 

Average  mean  squared  error 

AMSE  (/) 

n~X  EIL 1  eYn  {[ KXi )  -  f(xi)}2] 

=  E"=iMSE  [7(a0] 

Average  predictive  risk 

APR  (/) 

n~X  U=1  eYn,Y*  {[Y*  -  7(*i)]2} 

=  a2  +  AMSE  (fj 

Residual  sum  of  squares 

RSS  (/) 

ri_1  EILibi  -  7>i)]2 

Leave-one-out  (ordinary)  CV  score 

OCV  (/) 

n~1  Y^=l[yi  -  f-i(xi)]2 

Generalized  CV  score 

GCV  [/] 

[n-  tr(S)]-1  YJi=i [Vi  ~  f(xi)]2 

All  rows  of  the  table  but  the  first  are  based  on  integrated  or  summed  squared  quantities  and,  hence, 
are  appropriate  for  a  model  of  the  form  y  =  f(x)  +  e  with  the  error  terms  e  having  zero  mean, 
constant  variance  cr2 ,  and  with  error  terms  at  different  x  being  uncorrelated.  CV  is  short  for  cross- 
validation,  with  OCV  and  GCV  being  described  in  Sects.  10.6.2  and  10.6.3,  respectively.  Notation: 
The  predictive  model  evaluated  at  covariates  x  is  /( x),  with  prediction  /( x)  based  on  the  observed 
data  Y„  =  [Yf, . . . ,  Y„\;  To  is  a  new  response  with  associated  covariates  a;o;  the  observed  data 
are  [yi,Xi],  i  =  1, . . . ,  n;  Y*  =  [Y*, . . . ,  Y*]  are  a  set  of  new  observations  with  covariates 
xi, . . .  ,x„  that  we  would  like  to  predict;  p(x)  is  the  distribution  of  the  covariates;  f-i(xi)  is  the 
prediction  at  the  point  Xi  based  on  the  observed  data  with  the  i-th  case,  [yi,  Xi],  removed;  S'  is  the 
“smoother”  hat  matrix  and  is  described  in  Sect.  10.6.1.  The  entries  in  the  last  three  lines  are  all 
estimates  of  ESPE(/) 


predict  a  response  >o  with  associated  covariates  x0.  We  calculate  the  expected 
squared  distance  between  the  response  Y0  and  the  fitted  function  f(x o).  The 
expectation  is  with  respect  to  both  Y0  and  repeat  (training)  data  Yn  =  [Yi, . . . ,  Yn] 
with  Y0  and  Yn  being  independent.  The  resultant  measure  is  known  as  the  predictive 
risk  and  may  be  decomposed  as 


EY„,Yo 


Ey„ 


Yj  -  /( x0)  +  f(x0)  -  f(x0) 


=  E 


0  {[To  -  /(®o)]2}  +Eyn  |  f(x0)  -  f(x o)  | 
2  x  Ev0  {[Y0-f(x0)]}EYn  {  [/(a:o)-/(*o)]  } 
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=  a2  +  Eyn 


=  (T2  +  MSE 


f(xo)~  f{x0) 
/(*  o) 


1  2 


Writing 


MSE 
we  have 
E  yv 


/(*o) 


=  Ev 


f(x0)  -Ey„  (7(®o))  +Ey„  (7(*o))  -  7(*o) 


>0  -  f(xo) 


=  <7  +  E  Y-„ 

+  EW 
=  cr2  +  bias 


Ey„  (7(*o))  -  /(®o) 

T(xo)  -  Ey„  (7(®o)) 


/Oo) 


vary- 


/(*o) 


In  terms  of  the  prediction  error  we  can  achieve  given  a  particular  model,  nothing 
can  be  done  about  cr2,  which  is  referred  to  as  the  irreducible  error.  Therefore,  we 
concentrate  on  the  MSE  of  the  estimator  f(x o): 


MSE 


f{x  o) 


=  E  y„ 


f{x o)  -  f(x o) 


=  bias 


/Oo) 


var 


/(*o) 


where  we  emphasize  that  the  MSE  is  calculated  at  the  point  Xq.  with  the  expectation 
over  training  samples.  As  we  discuss  subsequently,  the  estimators  /  we  consider 
are  indexed  by  a  smoothing  parameter,  and  selection  of  this  parameter  influences 
the  characteristics  of  /.  Little  smoothing  produces  a  wiggly  /,  with  low  bias  and 
high  variance.  More  extensive  smoothing  produces  /  with  greater  bias  but  reduced 
variance. 

To  summarize  the  MSE  over  the  range  of  x,  we  may  consider  the  integrated 
mean  squared  error  (IMSE).  For  univariate  x,  over  an  interval  [a,b\,  and  with 
density  p(x): 


IMSE  (/)  =  /  EYn 


1  2 


}  p(x)  dx 


=  /  bias 


f(x)  p(x)  dx  +  /  var  f(x)  p{x)  dx. 


(10.16) 


This  summary  will  be  encountered  in  Sect.  1 1.3.2. 
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An  alternative  to  the  IMSE,  that  may  be  more  convenient  to  use,  is  the  average 
mean  squared  error  (AMSE),  which  only  considers  the  errors  at  the  observations: 


AMSE 


-.71  s  2  A 

(/)  =  -  E Ey-  { [7(*o  -  /(*o]  } 

2—1  ^  ' 

l  n  f-  l2  1  ” 

=  -Vbias  f(xi)  +-Vvar  f(xt) 

n  L ^  n  L ^ 


2=1 


2=1 


(10.17) 


For  the  additive  errors  model  (10.13),  the  average  predictive  risk  (APR)  is 


-|  71  s  2 

APR  (7)  =  -  E  E^w*  \  [Y*  -  7(*i ) 


=  +  AMSE 


if)- 


where  Y*  =  [Y{ , . . . ,  are  the  new  set  of  observations  which  we  would  like  to 
predict  at  xi, . . . ,  xn,  and  are  independent  of  Yn.  In  Sect.  10.6.1,  a  procedure  for 
estimating  the  APR  will  be  described  in  the  context  of  smoothing  parameter  choice. 

We  denote  the  test  data  by  [y*,  x*],  i  =  1 , ,m.  For  continuous  data  and 
quadratic  loss,  we  may  evaluate  an  estimate  of  the  expected  squared  prediction 
error  (10.15): 


1 

TO 


E  [y*  -  /(*£) 


(10.18) 


where  /( x*)  is  the  estimator  based  on  the  training  data. 


10.4.2  Discrete  Responses  with  K  Categories 

With  the  loss  function  (10.7),  and  with  equal  losses,  the  generalization  error  is 

Pr  X'Y[g(X)^Y\,  (10.19) 

which  is  also  known  as  the  misclassification  probability.  Given  test  data  [y*,  x*], 
i  =  1, . . . ,  m,  the  empirical  estimate  is 

m 

-£/[?(**) 

m  z ' 


which  is  simply  the  proportion  of  misclassified  observations. 
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We  now  consider  the  binary  case  and  introduce  terminology  that  is  common  in  a 
medical  context,  before  describing  additional  measures  that  are  useful  summaries  of 
a  procedure  in  this  case.  Suppose  we  wish  to  predict  disease  status  given  covariates 
(symptoms)  x.  Define 

Y  —  f  0  ^ true  state  is  no  disease 
\  1  if  true  state  is  disease. 

A  classification  rule  g(x)  is 

.  f  0  if  prediction  is  no  disease 
\  1  if  prediction  is  disease. 

The  sensitivity  of  a  rule  is  the  probability  of  predicting  disease  for  a  diseased 
individual: 

Sensitivity  =  Pr  [g(x)  =  1  |  Y  =  1] . 

The  specificity  is  the  probability  of  predicting  disease-free  for  an  individual  without 
disease: 

Specificity  =  Pr  [g(x)  =  0  |  Y  =  0] . 

With  respect  to  Table  10.1,  recall  that  L( 0, 1)  is  the  loss  for  predicting  g(x)  =  1 
when  in  reality  Y  =  0  (so  we  predict  disease  for  a  healthy  individual)  and  1/(1, 0)  is 
the  loss  associated  with  predicting  healthy  for  a  diseased  individual.  Consequently, 
if  we  increase  the  former  loss  L(0, 1)  while  holding  L(1,0)  constant,  we  will 
be  more  conservative  in  declaring  a  patient  as  diseased,  which  will  increase  the 
specificity  and  decrease  the  sensitivity.4  An  alternative,  closely  related,  pair  of 
summaries  are  the  false-positive  fraction  (FPF)  and  true-positive  fraction  (TPF) 
defined,  respectively,  as 


FPF  =  Pr  [ff(X)  =  1  |  Y  =  0] 


and 

TPF  =  Pr  [5(X)  =  1  |  Y  =  1] . 


The  sensitivity  is  the  TPF,  and  the  specificity  is  (1  —  FPF).  Two  additional  measures 
are  the  positive  predictive  value  (PPV)  and  the  negative  predictive  value  (NPV), 
defined  as 


PPV  =  Pr  [Y  =  1  |  g(X)  =  1] 
NPV  =  Pr  [Y  =  0  |  g{X)  =  0] 


Pr  [g(X)  =  1  |  Y  =  1]  Pr(Y  =  1) 
Pr  [<?(*)  =  1] 

Pr  [g(X)  =  0  |  Y  =  0]  Pr(Y  =  0) 
Pr[ff(*)=0] 


which  give  the  probabilities  of  correct  assignments,  given  classification. 


4We  note  that  the  decision  problem  considered  here  has  many  elements  in  common  with  that  in 
which  we  choose  between  two  hypotheses,  as  discussed  in  Sect.  4.3.1.  The  sensitivity  is  analogous 
to  the  power  of  a  test,  while  1— specificity  is  analogous  to  the  type  I  error. 
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Now  define  a  classification  rule  that,  based  on  a  model  g(x)  (whose  parameters 
will  be  estimated  from  the  data),  assigns  g(x)  =  1  if  the  odds  of  disease 


Pr(Y  =  1  |  £c)  L(0, 1) 

Pr(Y  =  0\x)  >  1/(1,  0) 


as  discussed  in  more  detail  in  relation  to  (10.9).  Plotting  TPF(i?)  versus  FPF(i?) 
produces  a  receiver-operating  characteristic  (ROC)  curve.  The  ROC  curve  gives 
the  complete  behavior  of  FPF  and  TPF  over  the  range  of  R.  Pepe  (2003)  provides 
an  in-depth  discussion  of  the  above  summary  measures. 


10.4.3  General  Responses 

For  general  data  types  we  may  evaluate  the  deviance-like  loss  function  (10.12)  over 
the  test  data  [y*,  x*],  i  =  l, ...  ,m: 


to  measure  the  error  of  a  procedure. 

10.5  A  First  Look  at  Shrinkage  Methods 

We  describe  two  penalization  methods  that  are  used  in  the  context  of  multiple  linear 
regression,  ridge  regression  and  the  lasso. 

10.5.1  Ridge  Regression 

We  first  assume  that  y  has  been  centered  and  that  each  covariate  has  been 
standardized,  that  is, 


Consider  the  linear  model 


y  =  x(3  +  e 
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with  x  the  n  x  k  design  matrix,  (3  =  [/3i, . . . ,  3k]T  the  k  x  1  vector  of  parameters, 
and  E[e]  =  0,  var(e)  =  cr2I.  Note  that  there  is  no  intercept  in  the  model  due  to  the 
centering  of  t/i, . . . ,  yn. 

We  saw  in  Chap.  5  that  linear  models  are  an  analytically  and  computationally 
appealing  class  but,  with  many  predictors,  fitting  the  full  model  without  penalization 
may  result  in  large  predictive  intervals,  unless  the  sample  size  is  very  large  relative 
to  k.  Ridge  regression  is  an  approach  to  modeling  that  addresses  this  deficiency 

"BRIDGE 

by  placing  a  particular  form  of  constraint  on  the  parameters.  Specifically,  (3  is 
chosen  to  minimize  the  penalized  sum  of  squares: 


(10.20) 


for  some  A  >  0.  Using  a  Lagrange  multiplier  argument  (Exercise  10.6),  minimiza¬ 
tion  of  (10.20)  is  equivalent  to  minimization  of 


subject  to,  for  some  s  >  0, 


k 


(10.21) 


so  that  the  size  of  the  sum  of  the  squared  coefficients  is  constrained  (which  is 
known  as  an  L2  penalty).  The  intuition  behind  ridge  regression  is  that,  with  many 
parameters  to  estimate,  the  estimator  can  be  highly  variable,  but  by  constraining  the 
sum  of  the  squared  coefficients,  this  shortcoming  can  be  alleviated. 

Figure  10.3  shows  the  effect  of  ridge  regression  with  two  parameters,  3\  and  3i- 
The  elliptical  contours  in  the  top  right  of  the  figure  correspond  to  the  sum  of  squares. 
In  ridge  regression  this  sum  of  squares  is  minimized  subject  to  the  constraint  (10.21), 
and  for  k  =  2,  this  constraint  corresponds  to  a  circle,  centered  at  zero.  The  estimate 
is  given  by  the  point  at  which  the  ellipse  and  the  circle  touch. 

Writing  the  penalized  sum  of  squares  (10.20)  as 


( V  ~  x/3)T(y  -  x(3)  +  Xf3Tf3 


(10.22) 


it  is  easy  to  see  that  the  minimizing  solution  is 


BRIDGE  ,  -a 

f3  =  (x  x  +  Alfc)  xx  Y. 


—  1  „T' 


(10.23) 


Since  the  estimator  (10.23)  is  linear,  it  is  straightforward  to  calculate  the  variance- 
covariance  matrix,  for  a  given  A,  as 


/  RIDGE  \  o  i  i 

var  ( (3  J  =  a2{xTx  +  Xlk)~1xTx(xTx  +  Xlk)^1  ■ 


(10.24) 
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Fig.  10.3  Pictorial 
representation  of  ridge 
regression,  for  two  covariates. 
The  elliptical  contours 
represent  the  sum  of  squares, 
and  the  circle  represents  the 
constraint  corresponding  to 
the  1,2  penalty 


✓ 
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Beginning  with  a  normal  likelihood  y  \  (3  ~  Nn(x/3,  a2In)  and  adding  the 
penalty  term  A/3T/3  to  the  log-likelihood  also  leads  to  minimization  of  (10.22). 
The  resultant  estimator  (10.23)  is  therefore  sometimes  referred  to  as  a  maximum 
penalized  likelihood  estimator  (MPLE). 

^LS 

It  is  well  known  that  the  least  squares  estimator  (3  is  an  unbiased  estimator, 
with  variance  (.xt.t)  “ 1  a2  (under  correct  second  moment  specification).  If  we  write 
R  =  (xTx)  1,  then  the  ridge  regression  estimator  may  be  written  as  (Exercise  10.7) 

BRIDGE  ,  x  i  -^LS 

13  =(Ifc  +  A R)~l(3  ,  (10.25) 

showing  that  it  is  clearly  biased.  Turning  now  to  a  consideration  of  the  variance, 
let  x  =  UDVT  be  the  singular  value  decomposition  (SVD)  of  x.  In  the  SVD  U 
is  n  x  n,  V  is  k  x  k  and  D  is  an  n  x  k  diagonal  matrix  with  diagonal  elements 
di, ...  ,dk-  Then,  the  variance  of  the  ridge  estimator  (10.24)  may  be  written  as 

var  (j3  ^  =  cr2(xTx  +  \Ik)~1xTx(xTx  +  Alfc)^1  =  a2VAVT,  (10.26) 

where  A  is  a  diagonal  matrix  whose  elements  are  df  /  (df  +  A)2 .  The  variance  of  the 
least  squares  estimator  is 

var  (/T)  =  a2VWVT,  (10.27) 

where  W  is  a  diagonal  matrix  whose  elements  are  1/d2.  Hence,  the  reduction  in 
variance  of  the  ridge  regression  estimator  is  apparent.  The  derivations  of  (10.26) 
and  (10.27)  are  left  as  Exercise  10.7. 

With  respect  to  the  frequentist  methods  described  in  Chap.  2,  penalized  least 
squares  correspond  to  a  method  that  produces  an  estimating  function  with  finite 
sample  bias  but  with  potentially  lower  mean  squared  error  as  a  consequence  of  the 
penalization  term,  which  reduces  the  variance. 
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For  orthogonal  covariates  xTx  =  n  x  I*.,  the  ridge  regression  estimator  is 

- — -RIDGE  77,  -^LS 

0  =—tP  ■ 

n  +  X 

Hence,  in  this  case,  the  ridge  estimator  always  produces  shrinkage  towards  0. 
Figure  10.4(a)  illustrates  the  shrinkage  (towards  zero)  performed  by  ridge  regression 
for  a  single  parameter  in  the  case  of  orthogonal  covariates.  For  non-orthogonal 
covariates,  the  collection  of  estimators  undergoes  shrinkage,  though  individual 

BRIDGE 

components  of  (3  may  increase  in  absolute  value. 

The  fitted  value  at  a  particular  value  x  is 


,  ,  — -KiUUC 

f(x)=x(3 

=  x(xTx  +  XI  k)~1xTY 


(10.28) 

(10.29) 


with 


var 


/(* ) 


<j2x(xTx  +  Alfc)  1xTx(xTx  +  XI k)  1*T. 


(10.30) 


An  important  concept  in  shrinkage  is  the  “effective”  degrees  of  freedom  associ¬ 
ated  with  a  set  of  parameters.  In  a  ridge  regression  setting,  if  we  choose  A  =  0,  we 
have  k  parameters,  while  for  A  >  0  the  parameters  are  constrained  and  the  degrees 
of  freedom  will  effectively  be  lower,  tending  to  0  as  A  — >  oo.  Many  smoothers  are 
linear  in  the  sense  that  y  =  S’^'y,  with  ridge  regression  being  one  example,  as  can 
be  seen  from  (10.29).  For  linear  smoothers,  the  effective  (or  equivalent )  degrees  of 
freedom  may  be  defined  as 

pw  =  df(A)  =  tr  ,  (10.31) 

where  the  notation  [M1  emphasizes  the  dependence  on  the  smoothing  parameter. 
For  the  ridge  estimator,  the  effective  degrees  of  freedom  associated  with  estimation 
of  /?i , . . . ,  Pk  is  defined  as 

df(A)  =  tr  ^xiyffx  +  AIfc)-1a:T]  .  (10.32) 

Notice  that  A  =  0,  which  corresponds  to  no  shrinkage,  gives  df(A)  =  k  (so  long  as 
xTx  is  non-singular),  as  we  would  expect. 

There  is  a  one-to-one  mapping  between  A  and  the  degrees  of  freedom,  so  in 
practice,  one  may  simply  pick  the  effective  degrees  of  freedom  that  one  would  like 
associated  with  the  fit  and  solve  for  A.  As  an  alternative  to  a  user-chosen  A,  a  number 
of  automated  methods  for  choosing  A  are  described  in  Sect.  10.6. 

Insight  into  the  ridge  estimator  can  be  gleaned  from  the  following  Bayesian 
formulation.  Consider  the  model  with  likelihood 


y  |  /3,  cr2  ~  Nn(x(3,a2In), 


(10.33) 
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Fig.  10.4  The  comparison  for  single  estimate  of  different  forms  of  shrinkage,  with  alternative 
estimates  plotted  against  the  least  squares  estimate  /3LS  and  in  the  case  of  orthogonal  covariates: 
(a)  ridge  regression,  (b)  soft  thresholding  as  carried  out  by  the  lasso,  and  (c)  hard  thresholding  as 
carried  out  by  conventional  variable  selection.  On  all  plots,  the  line  of  equality,  representing  the 
unrestricted  estimate,  is  drawn  as  dashed 


with  a2  known,  and  prior 

/3  |  er2  ~  Nfe  (V  yl ^  . 

The  latter  form  shows  that  a  large  value  of  A  corresponds  to  a  prior  that  is  more 
tightly  concentrated  around  zero  and  so  leads  to  greater  shrinkage  of  the  collection 
of  coefficients  towards  zero.  A  common  A  for  each  (3j  makes  it  clear  that  we  need 
to  standardize  each  of  the  covariates  in  order  for  them  to  be  comparable. 

Using  derivations  similar  to  those  of  Sect.  5.7,  the  posterior  is 


/3|y~Nfc  /3  , er2 (x  x  +  XIk)~ 


— RIDGE 

where  (3  corresponds  to  (10.23),  confirming  that  the  posterior  mean  and  mode 
coincide  with  the  ridge  regression  estimator,  (10.23).  Interestingly,  the  posterior 

/ RIDGE  \ 

variance  var(/3  |  y)  differs  from  var  1/3  1 ,  as  given  in  (10.24). 
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Fig.  10.5  Ridge  estimates  for  the  prostate  data,  as  a  function  of  the  effective  degrees  of  freedom 


Example:  Prostate  Cancer 

As  described  in  Sect.  1.3.1,  the  response  in  this  dataset  is  log  (PSA),  and  there 
are  eight  covariates.  In  this  chapter,  we  take  the  aim  of  the  analysis  as  prediction  of 
log  PSA.  In  Chap.  5,  we  analyzed  these  data  using  a  Bayesian  approach  with  normal 
priors  for  each  of  the  eight  standardized  coefficients,  as  summarized  in  (5.66).  In  that 
case,  the  standard  deviation  of  the  normal  prior  was  chosen  on  substantive  grounds. 
Here,  we  illustrate  the  behavior  of  the  estimates  as  a  function  of  the  smoothing 
parameter. 

Figure  10.5  shows  the  eight  ridge  estimates  as  a  function  of  the  effective  degrees 
of  freedom  (which  ranges  between  0  and  8,  because  there  is  no  intercept  in  the 
model).  For  small  values  of  A,  the  effective  degrees  of  freedom  is  close  to  8,  and 
estimates  show  little  shrinkage.  In  contrast,  large  values  of  A  give  effective  degrees 
of  freedom  close  to  0  and  strong  shrinkage.  Notice  that  the  curves  do  not  display 
monotonic  shrinkage  due  to  the  non-orthogonality  of  the  covariates. 
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10.5.2  The  Lasso 


The  least  absolute  shrinkage  and  selection  operator,  or  lasso,  as  described  in 
Tibshirani  (1996), 5  is  a  technique  that  has  received  a  great  deal  of  interest.  As  with 
ridge  regression,  we  assume  that  the  covariates  are  standardized  to  have  mean  zero 
and  standard  deviation  1 .  The  lasso  estimate  minimizes  the  penalized  sum  of  squares 

n  (  k  \  k 

Y  \yi  -  Po  -  Y  xaPj  I  +  A  Y  l&'l>  (10.34) 

i=!  \  i=i  /  j= 1 


with  respect  to  (3 .  The  L2  penalty  of  ridge  regression  is  therefore  being  replaced  by 
an  L\  penalty.  As  with  ridge  regression,  the  minimization  of  (10.34)  can  be  shown 
to  be  equivalent  to  minimization  of 

n  (  k  \2 

Y  I  Vi  -  Po  -  Y  1  (10.35) 

i=i  \  j= 1  J 

subject  to 

k 

Y\b\^8>  (10J6) 

3= 1 

for  some  s  >  0. 

^LS  LASSO 

Let  (3  and  (3  denote  the  least  squares  and  lasso  estimates,  respectively,  and 
define  so  =  Yj= 1  \ft?\  as  the  L\  norm  of  the  least  squares  estimate.  Values  of 

s  <  so  cause  shrinkage  of  1  |/^jASS°l  towards  zero.  If,  for  example,  s  =  So/2, 
then  the  average  absolute  shrinkage  of  the  least  squares  coefficients  is  50%,  though 
individual  coefficients  may  increase  rather  than  decrease  in  absolute  value. 

A  key  characteristic  of  the  lasso  is  that  individual  parameter  estimates  may  be 
set  to  zero,  a  phenomenon  that  does  not  occur  with  ridge  regression.  Figure  10.6 
gives  the  intuition  behind  this  behavior  in  the  case  of  two  coefficients  /3i  and  @2- 
The  lasso  performs  L\  shrinkage  so  that  there  are  “corners”  in  the  constraint;  the 
diamond  represents  constraint  (10.36)  for  k  =  2.  If  the  ellipse  (10.35)  “hits”  one  of 
these  corners,  then  the  coefficient  corresponding  to  the  axis  that  is  touched  is  shrunk 
to  zero.  In  the  example  in  Fig.  10.6,  neither  of  the  coefficients  would  be  set  to  zero, 
because  the  ellipse  does  not  touch  a  corner.  As  k  increases,  the  multidimensional 
diamond  has  an  increasing  number  of  corners,  and  so  there  is  an  increasing  chance 
of  coefficients  being  set  to  zero.  Consequently,  the  lasso  effectively  produces  a  form 
of  continuous  subset  (or  feature)  selection.  The  lasso  is  sometimes  referred  to  as 
offering  a  sparse  solution  due  to  this  property  of  setting  coefficients  to  zero. 


5The  method  was  also  introduced  into  the  signal-processing  literature,  under  the  name  basis 
pursuit,  by  Chen  et  al.  (1998). 


524 


10  Preliminaries  for  Nonparametric  Regression 


Fig.  10.6  Pictorial 
representation  of  the  lasso  for 
two  covariates.  The  elliptical 
contours  represent  the  sum  of 
squares,  and  the  diamond 
indicates  the  constraint 
corresponding  to  the  Li 
penalty 


0 

Pi 


In  the  case  of  orthonormal  covariates,  for  which  xTx  =  I /,. .  the  lasso  performs 
so-called  soft  thresholding.  Specifically,  for  component  j  of  the  lasso  estimator: 

0?*°  =  sign  (^s)  (\0?\  -  , 

where  “sign”  denotes  the  sign  of  its  argument  (±1),  and  z+  represents  the  positive 
part  of  z.  As  the  smoothing  parameter  is  varied,  the  sample  path  of  the  estimates 
moves  continuously  to  zero,  as  displayed  in  Fig.  10.4(b).  In  contrast,  conventional 
hypothesis  testing  performs  hard  thresholding ,  as  illustrated  in  Fig.  10.4(c),  since 
the  coefficient  is  set  equal  to  zero  when  the  absolute  value  of  the  estimate  drops 
below  some  critical  value,  giving  discontinuities  in  the  graph. 

The  lasso  solution  is  nonlinear  in  y.  Efficient  algorithms  exist  for  computation 
based  on  coordinate  descent;  however,  see  Meier  et  al.  (2008)  and  Wu  and  Lange 
(2008).  Tibshirani  (2011)  gives  a  brief  history  of  the  computation  of  the  lasso 
solution.  Due  to  the  nonlinearity  of  the  solution  and  the  subset  selection  nature  of 
estimation,  inference  is  not  straightforward  and  remains  an  open  problem.  Standard 

LASSO 

errors  for  elements  of  (3  are  not  immediately  available,  though  they  may  be 
calculated  via  the  bootstrap.  Since  the  lasso  estimator  is  not  linear,  the  effective 
degrees  of  freedom  cannot  be  defined  as  in  (10.31);  an  alternative  definition  exists 
as  „ 

df=  ^£cov(^), 

i=l 

see  Hastie  et  al.  (2009)  equation  (3.60). 

More  generally,  penalties  of  the  form 

k 

A£|/3y|9 

j= i 


10.5  A  First  Look  at  Shrinkage  Methods 
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may  be  considered,  for  q  >  0.  Ridge  regression  and  the  lasso  correspond  to  q  =  2 
and  q  =  1,  respectively.  For  q  <  1,  the  constraint  is  non-convex,  which  makes 
optimization  more  difficult.  Convex  penalties  occur  for  q  >  1  and  feature  selection 
for  q  <  1,  so  that  the  lasso  (with  q  =  1)  achieves  both. 

Many  variants  of  the  lasso  have  appeared  since  its  introduction  (Tibshirani  2011). 
In  some  contexts,  we  may  wish  to  treat  a  set  of  regressors  as  a  group,  for  example, 
when  we  have  a  categorical  covariate  with  more  than  two  levels.  The  grouped 
lasso  (Yuan  and  Lin  2007)  addresses  this  problem  by  considering  the  simultaneous 
shrinkage  of  (pre-defined)  groups  of  coefficients. 

In  the  case  in  which  k  >  n,  the  lasso  cannot  select  more  than  n  variables. 
Furthermore,  the  lasso  will  typically  assign  only  one  nonzero  coefficient  to  a 
set  of  highly  correlated  covariates  (Zou  and  Flastie  2005),  which  is  an  obvious 
disadvantage  and  was  a  motivation  for  the  group  lasso  (Yuan  and  Lin  2007). 
Empirical  observation  indicates  that  the  lasso  produces  inferior  performance  to 
ridge  regression  when  there  are  a  large  number  of  small  effects  (Tibshirani  1996). 
These  deficiencies  motivated  the  elastic  net  (Zou  and  Hastie  2005)  which  attempts 
to  combine  the  desirable  properties  of  ridge  regression  and  the  lasso  via  a  penalty 
of  the  form 

k  k 

£  I^J'I  + 

3= 1  3= 1 

The  lasso  estimate  is  equivalent  to  the  mode  of  the  posterior  distribution  under  a 
normal  likelihood,  (10.33),  and  independent  Laplace  (double  exponential)  priors  on 
elements  of  (3: 

=  ^exp(— A|/3j|) 

for  j  =  1 .....  A:  (the  variance  of  this  distribution  is  2/A2,  Appendix  D).  Under  this 
prior,  the  posterior  is  not  available  in  closed  form,  but  the  posterior  mean  will  not 
equal  the  posterior  mode.  Hence,  if  used  as  a  summary,  the  posterior  means  will 
not  produce  the  same  lasso  shrinkage  of  coefficients  to  zero.  Thus,  regardless  of 
the  value  of  A,  all  k  covariates  are  retained  in  a  Bayesian  analysis,  even  though  the 
posterior  mode  may  lie  at  zero.  Markov  chain  Monte  Carlo  allows  inference  under 
the  normal/Laplace  model  but  without  the  subset  selection  aspect,  which  lessens  the 
appeal  of  this  Bayesian  version  of  the  lasso. 


Example:  Prostate  Cancer 


We  illustrate  the  use  of  the  lasso  for  the  prostate  cancer  data.  Figure  10.7  shows  the 
lasso  estimates  as  a  function  of  the  shrinkage  factor: 


£ L  l/J 


'LASSO  | 


TLi  W 
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Fig.  10.7  Lasso  estimates  for  the  prostate  data,  as  a  function  of  the  shrinkage  factor, 

£i=l  l/3“ssol/£j=i  \Pf\ 


When  the  shrinkage  factor  is  1 ,  the  lasso  estimates  are  the  same  as  the  least  squares 
estimates.  Beginning  with  the  coefficient  associated  with  log  capsular  penetration 
and  ending  with  that  associated  with  log  cancer  volume  each  of  the  coefficients 
is  absorbed  at  zero,  as  the  coefficient  trajectories  are  traced  out.  For  example,  at 
a  shrinkage  factor  of  0.4,  only  3  coefficients  are  nonzero,  those  associated  with 
log  cancer  volume,  log  weight  and  Gleason.  In  this  example,  the  curves  decrease 
monotonically  to  zero,  but  this  phenomenon  will  not  occur  in  all  examples.  The 
piecewise  linear  nature  of  the  solution  is  apparent. 


10.6  Smoothing  Parameter  Selection 


For  both  ridge  regression  and  the  lasso,  as  well  as  a  number  of  methods  to  be 
described  in  Chaps.  11  and  12,  a  key  element  of  implementation  is  smoothing 
parameter  selection.11  We  denote  a  generic  smoothing  parameter  by  A  and  the 
estimated  function  at  this  A,  for  a  particular  covariate  value  x ,  by  ( x ). 


6We  use  the  name  “smoothing”  parameter  because  we  concentrate  on  nonparametric  regression 
smoothers  in  this  and  the  next  two  chapters,  but  in  the  context  of  ridge  regression  and  the  lasso,  the 
label  “tuning"  parameter  is  often  used. 


10.6  Smoothing  Parameter  Selection 
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In  this  section,  the  overall  strategy  is  to  derive  methods  for  minimizing,  with 
respect  to  A,  estimates  of  the  generalization  error,  or  related  measures.  We  initially 
assume  a  quadratic  loss  function  before  describing  smoothing  parameter  selection 
in  generalized  linear  model  situations. 

In  Sect.  10.6.1,  an  analytic  method  of  minimizing  the  AMSE  (Table  10.2)  is 
described  and  shown  to  be  equivalent  to  Mallows  CP  (Sect.  4.8.2).  Two  popular 
approaches  for  smoothing  parameter  selection,  ordinary  and  generalized  cross- 
validation,  are  described  in  Sects.  10.6.2  and  10.6.3,  and  in  Sect.  10.6.4,  we  describe 
the  AIC  model  selection  statistic,  which  extends  Mallows  CP  to  general  data  types. 
Finally,  Sect.  10.6.5  briefly  describes  cross-validation  for  generalized  linear  models. 

Bayesian  approaches  include  choosing  A  on  substantive  grounds  (as  carried  out 
in  Sect.  5.12)  or  treating  A  as  an  unknown  parameter.  In  the  latter  case,  a  prior  is 
specified  for  A,  which  is  then  estimated  in  the  usual  way.  Section  11.2.8  adopts 
a  mixed  model  formulation  and  describes  a  frequentist  approach  to  smoothing 
parameter  estimation,  with  restricted  maximum  likelihood  (REML,  see  Sect.  8.5.3) 
being  emphasized.  Section  11.2.9  takes  the  same  formulation  but  describes  a 
Bayesian  approaches  to  estimation. 

Smoothing  parameter  choice  is  an  inherently  difficult  problem  because,  in  many 
situations,  the  data  do  not  indicate  a  clear  “optimal”  A.  Therefore,  there  is  no 
universally  reliable  method  for  smoothing  parameter  selection.  Consequently,  in 
practice,  one  should  not  blindly  accept  the  solution  provided  by  any  method.  Rather, 
one  should  treat  the  solution  as  a  starting  point  for  further  exploration,  including  the 
use  of  alternative  methods. 


10.6.1  Mallows  CP 

In  this  section  we  assume  that  the  smoothing  method  produces  a  linear  smoother  of 
the  form  y  =  S(x,y.  Ridge  regression  provides  an  example  with  S(X)  =  x[x'x  + 
AI/-)“  'x1;  the  lasso  does  not  fall  within  this  class.  Many  methods  that  we  describe 
in  Chap.  1 1  produce  smoothers  of  linear  form. 

Recall,  from  Sect.  5.11.2,  that  in  linear  regression  y  =  Sy  where  S  = 
x(xTx)~1xT  is  the  hat  matrix,  and  tr^)  is  both  the  number  of  regression 
parameters  in  the  model  and  the  degrees  of  freedom.  Equation  (10.31)  defined  the 
effective  degrees  of  freedom  for  linear  smoothers  as  p ^  =  df(A)  =  tr  ( S One 
approach  to  smoothing  parameter  choice  is  to  simply  pick  A  to  produce  the  desired 
effective  degrees  of  freedom  piX> ,  if  we  have  some  a  priori  sense  of  the  degrees  of 
freedom  that  is  desirable.  This  allows  a  direct  comparison  with  parametric  models. 
For  example,  one  may  pick  p<x>  =  4  to  provide  a  fit  with  effective  degrees  of 
freedom  equal  to  the  number  of  parameters  in  a  cubic  polynomial  regression  model. 

An  appealing  approach  is  to  choose  the  smoothing  parameter  to  minimize  the 
average  mean  squared  error,  (10.17): 
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AMSE(a)  =  AMSE(/(a))  =  i  £  E  {[/(®0- fW  (*0]: 2 } 


=  -E 


(/-/{A))T(/-/{A))]  ,  (10.37) 


where  / 


=  [f(x1),...J(xn)]T  and  =  /(A)(®i),  ■  ■  • ,  f(X){xn) 


.  The 


AMSE  depends  on  the  unknown  f  and  so  is  not  directly  of  use.  A  more  applicable 
version  is  obtained  by  replacing  /  by  Y  —  e  (with  E[e]  =  0)  and  taking  /(A)  = 
S^Y  to  give 


AMSE(A)  =  -E 
n 

=  -E 


(y  -  e  -  SwyJ  (f  -  e  -  S(a)f) 

(f-  s,(A)r)T  (V-  swY^j 


E[eTe]  -  -E  |2eT(I  -  SW)Y 
n  n 


Replacing  Y  by  f  +  e  in  the  final  term  and  rearranging  gives 

(y  -  s^Yy  (y  -  s,(A)r) 


AMSE(a)  =  -E 
n 

1 


- E 

n 


eTe  +  2 eTf  -  2 eTS(x)f  -  2 eTS'(A)e 


Since 


Swe  =  E  tr  (eTS{x)ej  =E  tr  (s,(A)eeT)  =tr  (s(A)Io-2)  =a2tr  (s(A)) 


=  aVA), 


and  E[2eT/]  =  E[2eTS'(Al/]  =  0,  we  obtain 


AMSE(a)  =  -E 
n 

=  — E 
n 


—  (7  H — E 

n 

2 

-1 
n 


(y  -  S^yJ  (V  -  s(a)f)] 

(f  -  S(a)f)T  (f  -  5(a)f)]  -  cr2  +  Va)ct2.  (10.38) 


The  natural  estimator  of  (10.38)  is 

(f  -  s<a>f)t  (f  -  s^f)  -  ?Lx  + 
’rSS(A)  (  „  ,«\ 

-W^-(n-2p  ) 


- - -(A)  1 

AMSE  =  - 

n 
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where  (j/lax  >  0  is  an  estimate  from  a  maximal  model  (e.g.,  the  full  model  in 
a  regression  setting).  Minimizing  the  estimated  AMSL(A^  as  a  function  of  A  is 
therefore  equivalent  to  minimization  of  Mallows  CP  statistic,  (4.25): 

R~f(  }  ~  (n~  2P(X))  ■  (10.39) 

C  v  ' 

A  useful  quantity  to  evaluate  is  the  average  predictive  risk  (APR,  Table  10.2),  which 
is  the  predictive  risk  at  the  observed  Xj,  i  =  1, . . . ,  n.  Specifically, 

APR  =  a2  +  AMSE,  ( 1 0.40) 


which  can  be  estimated  by 


APR 


(A) 


=  OV, 


+  -RSS(a) 
n 


rss(a) 

n 


2p(A)  ^ 

- a 

n 


2 

max' 


(10.41) 


Estimating  APR  by  the  average  residual  sum  of  squares  (i.e.  the  first  term  in  (10.41)) 
is  clearly  subject  to  overfitting  (and  hence  will  be  an  underestimate),  but  this  is 
corrected  for  by  the  second  term. 


10.6.2  K-Fold  Cross-Validation 


A  widely  used  and  simple  method  for  estimating  prediction  error,  and  hence 
smoothing  parameters,  is  cross-validation.  If  we  try  to  estimate  the  APR,  as  given 
by  (10.40),  from  the  data  directly,  that  is,  using 


1 

n 


n 

[yi  ~  /(A)(a;i) 


rss(a) 

n 


we  will  obtain  an  optimistic  estimate  because  the  data  have  been  used  twice:  once 
to  fit  the  model  and  once  to  estimate  the  predictive  risk,  as  we  saw  in  (10.41). 
The  problem  is  that  the  idiosyncrasies  of  the  particular  realization  of  the  data  will 
influence  coefficient  estimates  so  that  the  model  will,  in  turn,  predict  the  data 
“too  well”.  As  noted  in  Sect.  10.4,  ideally  one  would  split  the  data  to  produce 
a  validation  dataset,  with  estimation  of  the  generalization  error  being  performed 
using  the  validation  data.  Unfortunately  there  are  frequently  insufficient  data  to  carry 
out  this  step.  However,  cross-validation  provides  an  approach  in  the  same  spirit  to 
estimate  the  APR. 

In  A' -fold  validation,  a  fraction  ( I\  —  1  )/K  of  the  data  are  used  to  fit  the  model. 
The  remaining  fraction,  1/AT,  are  predicted,  and  these  data  are  used  to  produce 
an  estimate  of  the  predictive  risk.  Let  y  =  [y i, . . .  ,yx\  represent  a  particular 
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A' -fold  split  of  the  n  x  1  data  vector  y.  Further,  let  J(k)  be  the  set  of  elements 
of  {1,  2, . . . ,  n }  that  correspond  to  the  indices  of  data  points  within  split  k,  with 
nk  =  \J(k) |  representing  the  cardinality  of  set  k.  Let  y_fc  be  the  data  with  the 

-•V  V \ 

portion  y^  removed  and  (xj)  represent  the  i-th  fitted  value,  computed  from 
fitting  a  model  using  y_&.  Cross-validation  proceeds  by  cycling  over  k  =  1, . . . ,  K 
through  the  following  two  steps: 


1.  Fit  the  model  using  y_;.. 

2.  Use  the  fitted  model  to  obtain  predictions  for  the  removed  data,  y^,  and  estimate 
the  error  as 


CV 


(A) 


-h  £  I*-®")1* 

ieJ(fc) 


(10.42) 


The  K  prediction  errors  are  averaged  to  give 


CV(A) 


1 

K 


£cv« 


This  procedure  is  repeated  for  each  potential  value  of  the  smoothing  parameter,  A. 
We  emphasize  that  the  data  are  split  into  K  pieces  once,  and  so  the  resultant  datasets 
are  the  same  across  all  A. 

Typical  choices  for  K  include  5,10,  and  n,  the  latter  being  known  as  leave-one- 
out  or  ordinary  cross-validation  (OCV).  Picking  K  =  n  produces  an  estimate  of  the 
expected  prediction  error  with  the  least  bias,  but  this  estimate  can  have  high  variance 
because  the  n  training  sets  are  so  similar  to  one  another.  The  computational  burden 
of  OCV  can  be  heavy,  though  for  a  large  class  of  smoothers  this  burden  can  be  side¬ 
stepped,  as  we  describe  shortly.  For  smaller  values  of  I\ ,  the  variance  of  the  expected 
prediction  error  estimator  is  smaller  but  there  is  greater  bias.  Breiman  and  Spector 
(1992)  provide  some  discussion  on  choice  of  K  and  recommend  K  =  5  based  on 
simulations  in  which  the  aim  was  subset  selection.  A  number  of  authors  (e.g.,  Hastie 
et  al.  2009)  routinely  create  an  estimate  of  the  standard  error  of  the  cross-validation 
score,  (10.42).  This  estimate  assumes  independence  of  CW^\  k  =  1, . . . ,  K,  which 
is  clearly  not  true  since  each  pair  of  splits  share  a  proportion  1  —  1  /(K  —  1)  of  the 
data. 

We  consider  leave-one-out  cross-validation  in  more  detail.  It  would  appear  that 
we  need  to  fit  the  model  n  times,  but  we  show  that,  for  a  particular  class  of  smoothers 
(to  be  described  below), 


1 

n 


n 

E  [»•  - 


1 

W 

n  ' 
2=1 


[yi  ~ 


(i  -4A)y 


(10.43) 


where  is  the  ith  diagonal  element  of  S^,  and  (xj)  is  the  ith  fitted  point, 
based  on  y_j. 
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We  prove  (10.43),  for  a  particular  class  of  smoothers,  based  on  the  derivation  in 
Wood  (2006,  Sect.  4.5.2).  For  many  smoothing  methods,  including  ridge  regression, 
we  can  write  the  model  as  /  =  hf3  where  h  =  [hi, . . . ,  hn]T  is  an  n  x  J  design 
matrix  with  hi  a  1  x  J  vector,  and  (3  is  a  J  x  1  vector  of  parameters.  We  prove  the 
result  (10.43)  for  a  class  of  problems  involving  minimization  of  a  sum  of  squares 
plus  a  quadratic  penalty  term: 

n 

Y  (Vi  -  h*Pf  +  WTD(3, 

i=l 

for  a  known  matrix  D.  Section  1 1.2.5  gives  further  examples  of  smoothers  that  fall 
within  this  class.  Fitting  the  model  to  the  n  —  1  points  contained  in  involves 
minimization  of 

n  n 

Y  [yj  -  f^(xj)}2  +  WTDf3  =  Y  [yj  -  2  +  A/3Tr>/3 

3= !>:/¥*  i=1 

(10.44) 


where 


y*  =  {  yi 

3  \yi  -  Vi  +  if  j  =  i. 


Minimization  of  (10.44)  yields 


/  =  S{X)y *  =  h(hTh  +  A  D)~lhTy\ 

and  Yi  (xi)  =  S^y*,  where  is  the  ith  row  of  S ^  and  y *  =  [y*, . . . ,  t/*]. 
Now 

=  sf  V 

=  s\x)y-S\?yi  +  S%)f{${xi) 

=  T‘X\xi)-S^yi  +  s\^{xi) 


so  that 


and 


Y\Xl)  =  /(A)(^»)  -  Su]yi 


i  -  Su 


M(1  ~  5“ ’’r  + ^  ^  i  (1045) 


i  -s> 


i  -sy 


(A) 
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as  required.  To  calculate  the  leave-one-out  CV  score,  we  therefore  need  only  the 
residuals  from  the  fit  to  the  complete  data  and  the  diagonal  elements  of  the  smoother 


matrix.  Note  that  the  effect  of  fl  —  S<jp'\  in  the  denominator  of  (10.43)  is  to  inflate 


the  residual  at  the  ?'-th  point,  hence  accounting  for  the  underestimation  of  simply 
using  the  residual  sum  of  squares.  Formula  (10.43)  is  true  for  all  linear  smoothers. 

In  practice,  curves  of  the  estimated  prediction  error  against  A  (the  smoothing 
parameter)  can  be  very  flat,  as  shown  for  instance  in  Fig.  10.9.  Therefore,  as  already 
noted,  simply  blindly  using  the  value  of  A  that  minimizes  the  cross-validation  sum 
of  squares  is  not  a  reliable  strategy.  In  Hastie  et  al.  (2009),  it  is  recommended  that  A 
be  chosen  such  that  the  prediction  error  is  no  greater  than  one  standard  error  above 
that  with  the  lowest  error.  This  approach  results  in  a  more  parsimonious  model  being 
selected,  though  this  recommendation  is  based  on  judgement  and  experience  rather 
than  theory. 


10.6.3  Generalized  Cross-Validation 

So-called  generalized  cross-validation  (GCV)  provides  an  alternative  to  /\-fold 
cross-validation.  The  GCV  score  is 


(10.46) 


for  a  linear  smoother  y  =  S^^y.  An  important  early  reference  on  the  use  of 
GCV  is  Craven  and  Wabha  (1979).  Recall  that  tr  is  the  effective  degrees 

of  freedom  of  a  linear  smoother,  (10.31),  with  larger  values  of  A  corresponding  to 
increased  smoothing.  Therefore,  the  denominator  of  (10.46)  is  the  squared  effective 
residual  degrees  of  freedom  and  a  measure  of  complexity:  increasing  A  decreases 
the  effective  number  of  parameters,  that  is,  the  complexity  of  the  model,  and  this 
reduction  produces  lower  variability.  Flowever,  the  numerator  is  the  residual  sum  of 
squares  and  as  such  is  a  measure  of  squared  bias  with  larger  A  giving  a  poorer  fit 
and  increased  bias.  Consequently,  we  see  that  the  GCV  score  is  providing  a  trade-off 
between  bias  and  variance.  Unlike  /\-fold  cross-validation,  GCV  does  not  require 
splitting  of  the  data  into  cross-validation  folds  and  repeatedly  training  and  testing 
the  model. 

GCV  may  be  justified/motivated  in  a  number  of  different  ways.  On  computational 
grounds,  the  GCV  score  is  simpler  to  evaluate  than  the  OCV  score,  since  one  only 
needs  the  trace  of  Slx>  and  not  the  diagonal  elements  S\P .  Recall  from  Sect.  5.11.2 
that  in  the  context  of  a  linear  model,  the  leverage  of  yi  is  defined  as  ,  and  so 
the  OCV  score  can  be  highly  influenced  by  a  small  number  of  data  points  (due  to 
the  presence  of  1  —  in  the  denominator  of  (10.43)),  which  can  be  undesirable. 
Therefore,  one  interpretation  of  GCV  is  that  it  is  simply  a  robust  alternative  to  OCV 
with  1  —  replaced  by  1  —  tr(S^A)  )/n,  which  is  clear  if  we  rewrite  (10.46)  as 
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GCV(A) 


1  "  (j/.~s,(>V)2r  t-sw  y 
nhi  (l  -  S^>y  V1  1  r(S(A))/nJ 


1  n 

Eh®'*' 


1  -S. 


(A) 


1  —  tr  (S(A))  jn 


This  representation  illustrates  that  those  observations  with  large  leverage  are  being 
down-weighted,  as  compared  to  OCV. 

A  final  justification  for  using  GCV,  which  was  emphasized  by  Golub  et  al.  (1979), 
is  an  invariance  property.  Namely,  GCV  is  invariant  to  certain  transformations  of 
the  data  whereas  OCV  is  not.  Suppose  we  transform  y  and  x  to  Qy  and  Qx, 
respectively,  where  Q  is  any  n  x  n  orthogonal  matrix  (i.e.,  QQT  =  Q'  Q  =  I„). 
For  fixed  A,  minimization  with  respect  to  (3  of 

(y  ~  xf3)T(y  -  xf3)  +  \(3T  (3 

leads  to  inference  that  is  identical  to  minimization  of 


(Qy  -  Qx(3)T(Qy  -  Qxf3)  +  Xf3Tf3. 

However,  for  fixed  A,  the  OCV  scores  are  not  identical,  so  that  A  obtained  via 
minimization  of  the  OCV  will  differ  depending  on  whether  we  work  with  y  or  Qy. 
If  is  the  linear  smoother  for  the  original  data,  then 

S(q]  =  QSWQT 

is  the  linear  smoother  for  the  rotated  data.  Note  that 


tr  =  tr  (qS^Q^  =  tr  (Va)Qtq)  =  tr  (s(A))  , 


and  GCV  is  invariant  to  the  choice  of  Q  (Golub  et  al.  1979).  It  can  be  shown 
(e.g.,  Wood  2006,  Sect.  4.5.3)  that  GCV  corresponds  to  the  rotation  of  the  data 
that  results  in  each  of  the  diagonal  elements  of  Sq1  being  equal.  Since  the  expected 
prediction  error  is  invariant  to  the  rotation  used,  the  GCV  score  shares  with  the  OCV 
score  the  interpretation  as  an  estimate  of  the  expected  prediction  error. 

Using  the  approximation  (1  —  x)~2  ss  1  +  2x  we  obtain 


GCV(a) 


i=  1 

RSS(a)  2  y(A)_ 

n  n 


2tr  (S^)  1 
n  n 


n 


which  is  proportional  to  Mallows  CP  if  we  replace  a^ax  in  (10.39)  with  cr2,  up  to  a 
constant  not  depending  on  A. 
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10.6.4  AIC  for  General  Models 


The  AIC  was  introduced  in  Sect.  4.8.2;  here  we  provide  a  derivation  as  a  generaliza¬ 
tion  of  Mallows  CP.  Consider  the  prediction  of  new  observations  Y*. . . . ,  Y*  with 
model 


for  i  =  l,...,n.  Suppose  we  fit  a  model  using  data  Yn  =  [Yi , . . . ,  Yn]  and  obtain 
the  MLE  f3.  The  expected  value  of  the  negative  maximized  log-likelihood  evaluated 
at  (3  is 


-E 


L{P)  =  ^  log 2tt  + n log cr  + ^^E<{  Yf-fS) 


Considering  the  last  term  only,  we  saw  in  Sect.  10.4.1  that 

E  E  {  \y?  -  MP)  1  2)  =  na2  +  E  E  { \fi( P)  -  MP) 


(10.47) 


and  Mallows  CP  was  derived  as  an  approximation  to  the  second  term,  with  “good” 
models  having  a  low  Cp . 

We  now  consider  a  general  log-likelihood  based  on  n  observations  ln(f3),  with 
our  aim  being  to  find  a  criterion  to  judge  the  “fits”  of  a  collection  of  models,  taking 
into  account  model  complexity.  The  basis  of  AIC  is  to  evaluate  a  model  based  on  its 
ability  to  predict  new  data  Y*,  i  =  1, . . . ,  n.  The  prediction  is  based  on  the  model 
v(y*  I  P)  with  P  being  the  MLE  based  on  an  independent  sample  of  size  n,  Yn. 

The  criterion  that  is  used  for  discrimination,  that  is,  to  decide  on  whether  the 
prediction  is  good,  is  the  Kullback-Leibler  distance  (as  discussed  in  Sect.  2.4.3) 
between  the  true  model  and  the  assumed  model.  The  distance  between  the  true 
(unknown)  distribution  pT(y*)  and  a  model  p(y*  \  (3)  is 


kl  [pT(y*),p(y*  \P)\  =  J  log 


(  My*)  \ 

\p{y*  I  P)J 


yT(y*)  dy*  >  o. 


A  good  model  with  estimator  (3  will  produce  a  small  value  of 


KL 


ft(y*),y(y*  I  P) 


(10.48) 


Unfortunately  (10.48)  cannot  be  directly  used,  since  pT(y*)  is  unknown,  but  we 
show  how  it  may  be  approximated,  up  to  an  additive  constant. 

Result:  Let  Yn  =  \Y\ , . . . ,  Yn]  be  a  random  sample  from  px(y)  and  suppose  a  model 
p{y  |  (3)  is  fitted  to  these  data  and  yields  MLE  f3,  where  (3  is  a  parameter  vector 
of  dimension  p.  For  simplicity,  we  state  and  prove  the  result  for  independent  and 
identically  distributed  data  but  the  result  is  true  in  the  nonidentically  distributed  case 
also.  We  wish  to  predict  an  independent  sample,  Y* ,  i  -  1 .... ,  r> ,  using  p(y*  \  (3). 
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Two  times  the  expected  distance  between  the  true  distribution  and  the  assumed 
distribution,  evaluated  at  the  estimator  /3,  is 


D*  =  2  x  Ey * 


HloS 

_i=  1 


(  Pt(Y*)  \ 

\p(Y*  |  3)  J 


=  2n  x  KL 


pr(y*),p(.y*  I P) 


Then,  we  have  the  approximation 


(10.49) 


D *  «  2 n  x  KL  \pT(y*),p(y*  \  PT)]+P,  (10.50) 


where  f3T  is  the  value  of  (3  that  minimizes  the  Kullback-Leibler  distance  between 
pT(y)  and  p(y  \  (3)  (for  discussion,  see  Sect.  2.4.3).  The  difference  between  (10.49) 
and  (10.50)  therefore  gives  the  increase  in  the  discrepancy  when  p(y*  \  (37)  is 
replaced  by  p(y*  \  (3). 

An  estimate  of  D*  is 

D*  =  -2  x  ln(f3)  +  2p  +  2ct 

where  Cr  =  f  log[pT(y*)]pT(?/*)  dy *  is  a  constant  that  is  common  to  all  models 
under  comparison.  Ignoring  this  constant  gives  Akaike’s  An  Information  Criterion1 
(AIC,  Akaike  1973): 

AIC  =  -2  x  ln(/3)+2p. 


Outline  Derivation 


The  outline  proof  presented  below  is  based  on  Davison  (2003,  Sect.  4.7).  The 
distance  measure  D*  given  in  (10.49)  is  two  times  the  expected  difference  between 
log-likelihoods: 


D*  =  E 


2nlogpx(T*) 


2nlogp(F*  |  3)  , 


(10.51) 


where  the  expectation  is  with  respect  to  the  true  model  pT(y*)-  We  proceed  by  first 
approximating  the  second  term  via  a  Taylor  series  expansion  about  (3T.  Let 


Si(f3) 


d_ 

dj3 


log p{Y  |  /3), 


h{(3)  =  -E 


'  d2 
d(3df3T 


log  p{Y  |  (3) 


denote  the  score  and  information  in  a  sample  of  size  one.  Then 


2nlogp(Y*  |  3)  w  2nlogp(Y*  \  (3T)  +  2n{(3  -  f3TYS1(f3T) 

—  n.(J3  —  /3t)tJ1(/3t)(3  —  /3t)- 


7Commonly  AIC  is  referred  to  as  Akaike’s  Information  Criterion. 
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Note  that  E  [Si(/3T)]  =  0  and  n(/3  —  /3T)TJi(/3T)(/3  —  (3T )  is  asymptotically  \p 
(Sect.  2.9.4)  so  its  expectation  is  p.  the  number  of  elements  of  (3.  Hence,  the  second 
term  in  ( 10.5 1)  may  be  approximated  by 


2nlogp(Y*  |  3)]  «  E  [2n  \ogp(Y*  \  fa)}  -  p.  (10.52) 


Therefore, 

D*  ~  2n  x  E  [logpT(Y*)  -  logp(Y*  |  fa)]  +  p 

=  2nl'0%[wm),h{y‘)dy'+p 

=  2n  x  KL\p1(y*),p(y*  \  fa)]  +p  (10.53) 

proving  (10.50). 

This  expression  for  D *  is  not  usable  because  />,  (■)  is  unknown.  An  estimator  of 


KL  \pi(y*),p(y*  \  fa)]  can  be  based  on  E 
We  write 


Ufa 


=  E 


log  p(Y  |  fa 


,  however. 


—  2  x  E 


Ufa 


=  2  x  E 


-Ufa)  -  {  Ufa  -  Ufa) }] 

-2nxE[-  logp(Y*  |  fa)]  -  p 

=  2n  x  E  [-  log p{Y*  |  fa)  +  log pT{Y*)  -  logpT(Y*)]  -  p 
=  2n  x  KL  [pT(y*),p(y  \  fa)}  -2c,-p  (10.54) 


where 


°t  =  j  log  [pT  (?/*)]  pT(y*)  dy *, 

and  we  have  used  the  asymptotic  result  that 

2 


Ufa  ~  Ufa)  Xp> 


(10.55) 


as  n  — >  oo 


see  (2.55).  It  follows,  by  rearrangement  of  (10.54),  that 


ln(fa  +  P  +  2  Ct 


2n  x  KL  [pT(y*),p(t/*  |  fa)\  ~  — 2  x  E 
which  suggests  an  estimator  of 

2 n  x  KL[pT(y*),p(y*  \  fa)]  =  -2  x  ln(J3)  +  p  +  2cT. 
This  estimate  can  be  substituted  into  (10.53)  to  give  the  estimator 

D*  =  -  2  x  (n(3)+2p+2or 


=  AIC  +  2c, 
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where  AIC  =  —  2  x  ln(/3)  +  2p.  Since  the  term  on  the  right  is  common  to  all  models, 
the  AIC  may  be  used  to  compare  models,  with  relatively  good  models  producing  a 
small  value  of  the  AIC.  Some  authors  suggest  retaining  all  models  whose  AIC  is 
within  2  of  the  minimum  (e.g.  Ripley  2004).  □ 

The  above  derivation  is  based  on  a  number  of  assumptions  (Ripley  2004)  including 
the  model  under  consideration  being  true.  The  accuracy  of  the  approximations  is 
also  much  greater  if  the  models  under  comparison  are  nested. 

In  a  GLM  smoothing  setting,  the  AIC  may  be  minimized  as  a  function  of  A,  with 
the  degrees  of  freedom  p  being  replaced  by  tr  (S(x>).  The  AIC  criteria  in  this  case  is 

AIC(a)  =  -210)  +  2  x  tr  ,  (10.56) 


with  the  second  term  again  measuring  complexity. 


An  Aside 


The  derivation  of  AIC  was  carried  out  under  the  assumption  of  a  correct  model, 
which  was  required  to  obtain  (10.52)  and  (10.55).  If  the  model  is  wrong,  then 
y/n(f3  —  (3t)  is  asymptotically  normal  with  zero  mean  and  variance  /  AT / '  1 
where 


K  =  K(f3T)  =  E 


^l°g P(Y  I  0T))  (^3  logp(y  I  &) 


see  Sect.  2.4.3.  Hence,  using  identity  (B.4)  from  Appendix  B,  the  expectation  of 
n( 3  -  f3T)TI1{/3T)0  -  /3t)  is 

tr[l1(PT)Ii(PT)~1K(PT)IAPA~1]=^[K(PMM~1]  ■ 


Similarly,  under  a  wrong  model,  the  likelihood  ratio  statistic  2 

an  asymptotic  distribution  proportional  to  \p  but  with  mean 
This  follows  since,  via  a  Taylor  series  approximation. 


L0)  ~  ln(PT)  has 
ir[K(f: g/rCeg-1]. 


2 


ln(PT) 


n0  —  (3t)tIi((3t)0  —  /3t). 


Replacing  p  by  tr  [AT(/3T)ii(/3T)  x]  in  the  above  derivation  gives  the  alternative 
network  information  criterion  (NIC) 


NIC  =  —21(13)  +  2  x  tr  K((3)Ii((3) 


as  introduced  by  Stone  (1977). 
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10.6.5  Cross-  Validation  for  Generalized  Linear  Models 


As  discussed  in  Sect.  10.3.1,  for  general  outcomes,  a  loss  function  for  measuring 
the  accuracy  of  a  prediction  is  the  negative  log-likelihood.  Hence,  cross-validation 
can  be  extended  to  general  data  situations  by  replacing  the  sum  of  squares  in  (10.42) 
with  a  loss  function  to  give 


CVLA)  =  — 

nk 


i£J(k) 


yuf^k(xi) 


In  particular,  the  negative  log-likelihood  loss  (10.12)  produces 


CVfcA)  =  loS (Vi  I  **)  > 

Til.  Z '  J-k 

i£j(k) 


where  this  notation  emphasizes  that  the  prediction  at  the  point  x.-L  is  based  upon  the 
fitted  value  Similarly,  a  natural  extension  of  (10.46)  is  the  generalized  cross- 
validation  score  based  on  the  log-likelihood 


gcv(a)  = 


2  n 

\n  —  tr(S(A))] 2 


n 

Elos  p?» 

i=l 


(Vi  I  *i)  ■ 


Some  authors  (e.g.,  Ruppert  et  al.  2003,  p.  220)  replace  the  log-likelihood  by  the 
deviance,  which  adds  a  term  that  does  not  depend  on  A. 


Example:  Prostate  Cancer 

We  illustrate  smoothing/tuning  parameter  choice  and  estimation  of  the  prediction 
error  using  various  approaches  to  modeling  and  a  number  of  the  methods  described 
in  Sect.  10.6  for  smoothing  parameter  estimation.  The  modeling  approaches  we 
compare  are  fitting  the  full  model  using  least  squares,  and  picking  the  “best”  subset 
of  variables  via  an  exhaustive  search  based  on  Mallows  CP ,  ridge  regression,  the 
lasso,  and  Bayesian  model  averaging  (Sect.  3.6).  We  divide  the  prostate  data  into 
a  training  dataset  of  67  randomly  selected  individuals  and  a  test  dataset  of  the 
remaining  30  individuals.  Since  the  sample  size  is  small,  we  repeat  this  splitting 
500  times  and  then  evaluate,  for  the  different  methods,  the  average  error  and  its 
standard  deviation  over  the  train/test  splits.  An  important  point  to  emphasize  is 
that  we  standardize  the  x  variables  in  the  training  dataset  and  then  apply  the  same 
standardization  in  the  test  dataset  (and  this  procedure  is  repeated  separately  for  each 
of  the  500  splits). 
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Table  10.3  Average  test  errors  over  500  train/test  splits  of  the  prostate 
cancer  data,  along  with  the  standard  deviation  over  these  splits 


Null 

Full 

Best  subset 

Ridge 

Lasso 

BMA 

Mean 

1.30 

0.59 

0.76 

0.59 

0.60 

0.59 

SD 

0.32 

0.15 

0.35 

0.14 

0.14 

0.14 

Table  10.3  gives  summaries  of  the  test  error,  calculated  via  (10.18),  for  the  five 
approaches.  We  also  report  the  error  that  results  from  fitting  the  null  (intercept  only) 
model.  The  latter  is  a  baseline  reference,  and  gives  an  error  of  1.30.  The  estimate 
of  error  corresponding  to  the  full  model  fitted  with  least  squares  is  0.59,  a  reduction 
of  71%.  The  exhaustive  search  over  model  space  (i.e.,  the  28  =  256  combinations  of 
8  variables),  using  Mallows  CP  as  the  model  selection  criterion,  was  significantly 
worse  giving  an  error  of  0.76  with  a  large  standard  deviation.  Table  10.4  shows 
the  variability  across  train/test  splits  in  the  model  chosen  by  the  exhaustive  search 
procedure.  For  example,  34.2%  of  models  contained  only  the  variables  log(can  vol), 
log(weight),  and  SVI.  The  seven  most  frequently  occurring  models  account  for 
73.8%  of  the  total,  with  the  remainder  being  spread  over  27  other  combinations  of 
variables.  The  table  illustrates  the  discreteness  of  the  exhaustive  search  procedure 
(as  discussed  in  Sect.  4.9)  and  explains  the  poor  prediction  performance.  Ridge 
regression  and  the  lasso  were  applied  to  each  train/test  split  with  A  chosen  via 
minimization  of  the  OCV  score.  The  entries  in  Table  10.3  show  that,  for  these  data, 
the  shrinkage  methods  provide  prediction  errors  which  are  comparable  to,  and  not 
an  improvement  on,  the  full  model.  The  reason  for  this  is  that  in  this  example  the 
ratio  of  the  sample  size  to  the  number  of  parameters  is  relatively  large,  and  so  there 
is  little  penalty  for  including  all  parameters  in  the  model. 

Figure  10.8  illustrates  the  variability  across  train/test  splits  of  the  optimal 
effective  degrees  of  freedom,  chosen  via  minimization  of  (a)  the  OCV  score  and  (b) 
Mallows  CP,  for  the  ridge  regression  analyses.  The  two  measures  are  then  plotted 
against  each  other  in  (c)  and  show  reasonable  agreement.  There  is  a  reasonable 
amount  of  variability  in  the  optimal  degrees  of  freedom  across  simulations. 

The  final  approach  included  in  this  experiment  was  Bayesian  model  averaging. 
In  this  example,  the  performance  of  BMA  matches  that  of  ridge  regression  and  the 
lasso.  BMA  is  superior  to  exhaustive  search  because  covariates  are  not  excluded 
entirely,  but  rather  every  model  is  assigned  a  posterior  weight  so  that  all  covariates 
contribute  to  the  fit.  A  number  of  successful  approaches  to  prediction,  including 
boosting,  bagging,  and  random  forests  (Hastie  et  al.  2009),  gain  success  from 
averaging  over  models,  since  different  models  can  pick  up  different  aspects  of  the 
data,  and  the  variance  is  reduced  by  averaging.  Bagging  and  random  forests  are 
described  in  Sects.  12.8.5  and  12.8.6,  respectively. 

We  now  provide  more  detail  on  the  ridge  regression,  lasso,  and  Bayesian  model 
averaging  approaches.  We  first  consider  in  greater  detail  the  application  of  ridge 
regression.  Figure  10.9  shows  estimates  of  the  test  error,  evaluated  via  different 
methods,  as  a  function  of  the  effective  degrees  of  freedom,  for  a  single  train/test 
split.  The  minimizing  values  are  indicated  as  vertical  lines.  The  dotted  line  shows 
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a  b 


Optimal  DF  Optimal  DF 


Fig.  10.8  Minimizing  values  of  the  effective  degrees  of  freedom  for  ridge  regression  from  500 
train/test  splits  of  the  prostate  cancer  data  using:  (a)  OCV,  (b)  Mallows  Cp  as  the  minimizing 
criteria.  Panel  (c)  plots  the  optimal  degrees  of  freedom  arising  from  each  criteria  against  each 
other 


the  estimate  as  the  AMSE  plus  the  estimate  of  the  error  variance,  (10.41).  The 
minimizing  value  of  AMSE  (which  is  equivalent  to  minimizing  Mallows  CP)  is 
very  similar  to  that  obtained  with  the  OCV  criteria  and  is  also  virtually  identical  to 
that  obtained  from  GCV.  In  all  cases,  the  curves  are  flat  close  to  the  minimum,  so  one 
would  not  want  to  overinterpret  specific  numerical  values.  The  effective  degrees  of 
freedom  corresponding  to  the  minimum  OCV  is  5.9,  while  under  GCV  and  Mallows, 
the  values  are  identical  and  equal  to  5.7.  The  fivefold  CV  estimate  is  minimized 
for  a  slightly  larger  value  than  for  OCV  for  this  train/test  split  (effective  degrees 
of  freedom  of  6.6);  over  all  train/test  splits,  fivefold  CV  produced  a  comparable 
prediction  error  to  OCV.  Also  included  in  the  figure  is  the  average  residual  sum  of 
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Table  10.4  Percentage  of  models  selected  in  an  exhaustive  best  subset  search, 
over  500  train/test  splits  of  the  prostate  cancer  data 


Variables  selected 

lcavol  lweight 

age 

lbph 

svi 

lcp 

gleason 

Pgg45 

Percentage 

1  1 

0 

0 

i 

0 

0 

0 

34.2 

1  1 

1 

1 

i 

0 

0 

0 

11.4 

1  1 

0 

1 

i 

0 

0 

0 

11.0 

1  0 

0 

1 

i 

0 

0 

0 

5.8 

1  0 

1 

1 

i 

0 

0 

0 

4.8 

1  1 

0 

0 

i 

0 

0 

1 

3.4 

1  1 

1 

0 

i 

0 

0 

0 

3.2 

Fig.  10.9  Various  estimates 
of  error,  as  a  function  of  the 
effective  degrees  of  freedom, 
for  ridge  regression  applied  to 
the  prostate  cancer  data. 
Minimizing  values  are  shown 
as  vertical  lines.  Also  shown 
is  the  residual  sum  of  squares 
(which  has  a  minimum  at  8 
degrees  of  freedom ) 


Degrees  of  Freedom 


squares,  which  is  minimized  at  the  most  complex  model  (degrees  of  freedom  equal 
to  8),  as  expected,  and  underestimates  the  predictive  error,  since  the  data  are  being 
used  twice. 

Turning  now  to  the  lasso,  Figs.  10.10(a)  and  (b)  show  the  OCV  and  GCV 
estimates  of  error  versus  the  coefficient  shrinkage  factor,  along  with  estimates  of 
the  standard  error.  As  with  ridge  regression,  the  curves  are  relatively  flat  close  to  the 
minimum,  indicating  that  we  should  not  be  wedded  to  the  exact  minimizing  value  of 
the  smoothing  parameter.  For  this  train/test  split,  the  minimizing  value  of  the  OCV 
function  leads  to  three  coefficients  being  set  to  zero. 

Finally,  for  Bayesian  model  averaging.  Fig.  10.11  provides  a  plot  in  which 
the  horizontal  axis  orders  the  models  in  terms  of  decreasing  posterior  probability 
(going  from  left  to  right),  with  the  variables  indicated  on  the  vertical  axis.  Black 
rectangles  denote  inclusion  of  that  variable  and  gray,  no  inclusion.  The  posterior 
model  percentages  for  the  top  five  models  are  23%,  17%,  8%,  7%,  and  6%. 
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Fig.  10.10  (a)  OCV  and  (b)  fivefold  CV  estimates  of  error  for  the  lasso,  as  a  function  of  the 
scaling  factor,  \/3j\/  X^j=i  \^Y\’  f°r  t'le  Prostate  cancer  data.  The  minimizing  value  of  the 

CV  estimates  of  error  is  shown  as  a  solid  vertical  line.  Also  shown  are  approximate  standard  error 
bands  evaluated  as  if  the  CV  estimates  were  independent  (as  discussed  in  Sect.  10.6.2) 

Fig.  10.11  From  left  to  right 
this  plot  shows,  for  a 
particular  split  of  the  prostate 
cancer  data,  the  models  with 
the  highest  posterior 
probability,  as  evaluated  via 
Bayesian  model  averaging 


Model  Number 


10.7  Concluding  Comments 

Whether  parametric  or  nonparametric  models  are  used,  the  bias-variance  trade-off 
is  a  key  consideration.  In  nonparametric  modeling  there  are  explicit  smoothing 
parameters  that  determine  this  trade-off.  We  saw  this  with  both  ridge  regression 
and  the  lasso,  and  this  issue  will  return  repeatedly  in  Chaps.  11  and  12.  The 
choice  of  smoothing  parameter  is,  therefore,  crucial  and  a  variety  of  approaches  for 
selection,  including  cross-validation  and  the  minimization  of  Mallows  CP  have  been 
described.  Additional  methods  will  be  described  in  Chap.  1 1 ,  but  no  single  approach 
will  work  in  all  situations,  and  often  subjective  judgement  is  required.  Hardle  et  al. 
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(1988)  have  shown  that  smoothing  parameter  methods  such  as  Mallows  CP  and 
GCV  converge  slowly  to  the  optimum  as  the  sample  size  increases.  A  number  of 
simulation  studies  have  been  carried  out  and  back  up  the  above  comments,  see,  for 
example,  Ruppert  et  al.  (2003,  Sect.  5.4)  and  references  therein. 


10.8  Bibliographic  Notes 

There  are  many  excellent  texts  on  nonparametric  regression,  including  Green  and 
Silverman  (1994),  Simonoff  (1997),  Ruppert  et  al.  (2003),  Wood  (2006),  and,  more 
recently  and  with  a  large  range  of  topics,  Hastie  et  al.  (2009).  Gneiting  and  Raftery 
(2007)  provide  an  excellent  review  of  scoring  rules,  which  are  closely  related  to 
the  loss  functions  considered  in  Sect.  10.3.  An  important  early  reference  on  ridge 
regression  is  Hoerl  and  Kennard  (1970).  Since  its  introduction  in  Tibshirani  (1996), 
the  lasso  has  been  the  subject  of  much  interest,  see  Tibshirani  (20 1 1 )  and  the  ensuing 
discussion  for  a  summary.  There  is  a  considerable  literature  on  the  theoretical 
aspects  of  the  lasso,  for  example,  examining  its  properties  with  respect  to  prediction 
loss  and  model  selection,  see  Meinshausen  and  Yu  (2009)  and  references  therein. 


10.9  Exercises 

10.1  For  the  LIDAR  data  described  in  Sect.  10.2.1  fit  polynomials  of  increasing 
degree  as  a  function  of  range  and  comment  on  the  fit  to  the  data.  These 
data  are  available  in  the  R  package  Semi-Par  and  are  named  lidar.  What 
degree  of  polynomial  is  required  to  obtain  an  adequate  fit  to  these  data? 
[Hint:  One  method  of  assessing  the  latter  is  to  examine  residuals.] 

10.2  The  BPD  data  described  in  Sect.  7.2.3  are  available  on  the  book  website.  Fit 
linear  and  quadratic  logistic  regression  models  to  these  data  and  interpret  the 
parameters. 

10.3  Carry  out  backwards  elimination  for  the  prostate  cancer  data,  which  are 
available  in  the  R  package  lasso2  and  are  named  Prostate.  Comment 
on  the  standard  errors  of  the  estimates  in  the  final  model  that  you  arrive  at, 
as  compared  to  the  corresponding  estimates  in  the  full  model. 

10.4  With  reference  to  Sect.  10.3.1: 

a.  Show  that  minimization  of  expected  quadratic  loss,  Ex  r  {[Y  —  /(X)]2  j 
leads  to  f(x)  =  E[Y  \  x], 

b.  Show  that  minimization  of  expected  absolute  value  loss,  Ex  v[  | Y  — 
f(X)  |  ]  leads  to  f(x)  =  median(F  |  x). 

c.  Consider  the  bilinear  loss  function 
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L[y,f  (*)] 


a[y-  f{x)\  if  f(x)  <  y 
b[f(x)  -  y]  if  /( x)  >  y. 


Deduce  that  this  leads  to  the  optimal  f(x)  being  the  100  x  a/ (a  +  b)% 
point  of  the  distribution  function  of  Y . 

10.5  a.  Show  that  the  expected  value  of  scaled  quadratic  loss 


E  Y  | 


Y2  f 


is  minimized  by 


/(*) 


Efy-1 1  x\ 
E [Y~2  I  x\ ' 


b.  Supposed  |  n(x),a  ~  Ga{a  1,  [/z(£c)a]  and  that  prediction  of  Y 
using  /( x)  is  required,  under  scaled  quadratic  loss.  Show  that  f(x)  = 
E[F-1  |  x]  =  (1  —  2a)n(x). 

[Hint:  If  Y  \  a,  b  ~  Ga(a,  b),  then  F-1  |  a,  b  ~  InvGa(a,  b).] 


10.6  From  Sect.  10.5.1  show,  using  a  Lagrange  multiplier  argument,  that  minimiz¬ 
ing  the  penalized  sum  of  squares: 


E 


Po  E!  xij  Pj 


3= 1 


k 

i=i 


is  equivalent  to  minimization  of 


E 


y i  -  Po  -  E  xijPj 

i=i 


2 


subject  to 

1=1 

for  some  s. 

10.7  Prove  the  alternative  formulas  ( 10.25)— (10.27)  for  ridge  regression. 

10.8  Show,  using  (10.45),  that 


I  Vi  -  (Xi)  I  >  I  Vi~  f(X)(Xi)  |. 


Interpret  this  result. 

10.9  Cross-validation  can  fail  completely  for  some  problems,  as  will  now  be 
illustrated. 
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(a)  Suppose  we  smooth  a  response  y*,  by  minimizing,  with  respect  to  //,, 
i  =  1, . . . ,  n,  the  ridge  regression  sum  of  squares 

n  n 

'y  'Xu*  -  a  y  '  m^  > 

2=1  2=1 

where  A  is  the  smoothing  parameter.  Show  that  for  this  problem,  the  OC  V 
and  GCV  scores  are  identical  and  independent  of  A. 

(b)  By  considering  the  basic  principle  of  OCV,  explain  what  causes  the 
failure  of  the  previous  part. 

(c)  Given  the  explanation  of  the  failure  of  cross-validation  for  the  ridge 
regression  problem  in  part  (a),  it  might  be  expected  that  the  following 
modified  approach  will  work  better.  Suppose  a  covariate  Xi  is  observed 
for  each  y,  (and  for  convenience,  assume  xt  <  xt+\  for  all  i).  Define 
fi{x)  to  be  the  piecewise  linear  function  with  n  —  1  linear  segments 
between  at,  and  Xi-i  for  i  =  2, . . . ,  n.  In  this  case  fii  could  be  estimated 
by  minimizing  the  following  penalized  least  squares  objective: 

(Vi  -  Mi)2  +  A 

i=l 

with  respect  to  /ij,  i  =  1, . . . ,  n. 

Now  consider  three  equally  spaced  points  X\,X2,  .£3  with  corresponding 
fi  values  fix,  /j,2,  M3-  Suppose  that  Mi  =  M3  =  M*>  but  that  /i2  can  be 
freely  chosen.  Show  that  in  order  to  minimize  J)’3  fi(x)2dx,  fi 2  should 
be  set  to  — /-£*/ 2.  What  does  this  imply  about  trying  to  choose  A  by  cross- 
validation? 

[Hint:  think  about  what  the  penalty  will  do  to  fii  if  we  “leave  out”  ;</,.] 

(d)  Would  the  penalty 

fi'{x)2  dx 

suffer  from  the  same  problem  as  the  penalty  used  in  part  (c)? 

(e)  Would  you  expect  to  encounter  these  sorts  of  problems  with  penalized 
regression  smoothers?  Explain  your  answer. 

10.10  In  this  question  data  in  the  R  package  faraway  that  are  named  meatspec 
will  be  analyzed.  Theses  data  concern  the  fat  content,  which  is  the  response, 
measured  in  215  samples  of  finely  chopped  meat,  along  with  100  covariates 
measuring  the  absorption  at  100  wavelengths.  Perform  ridge  regression  on 
these  data  using  OCV  and  GCV  to  choose  the  smoothing  parameter.  You 
should  include  a  plot  of  how  the  estimates  change  as  a  function  of  the 
smoothing  parameter  and  a  plot  displaying  the  cross-validation  scores  as  a 
function  of  the  smoothing  parameter. 

10.11  For  the  prostate  cancer  data  considered  throughout  this  chapter,  reproduce 
the  summaries  in  Table  10.3,  coding  up  “by  hand”  the  cross-validation 
procedures. 


fi(x)2dx, 


Chapter  11 

Spline  and  Kernel  Methods 


11.1  Introduction 

Spline  models  are  based  on  piecewise  polynomial  fitting,  while  kernel  regression 
models  are  based  on  local  polynomial  fitting.  These  two  approaches  to  modeling 
are  extremely  popular,  and  so  we  dedicate  a  whole  chapter  to  their  description. 

The  layout  of  this  chapter  is  as  follows.  In  Sect.  11.2,  a  variety  of  approaches 
to  spline  modeling  are  described,  while  Sect.  11.3  discusses  kernel-based  methods. 
For  inference,  an  estimate  of  the  error  variance  is  required;  this  topic  is  discussed 
in  Sect.  1 1.4.  In  this  chapter  we  concentrate  on  a  single  x  variable  only.  However, 
we  do  consider  general  responses  and,  in  particular,  the  class  of  generalized  linear 
models.  Approaches  for  these  types  of  data  are  described  in  Sect.  1 1.5.  Concluding 
comments  appear  in  Sect.  11.6.  There  is  an  extensive  literature  on  spline  and 
kernel  modeling;  Sect.  11.7  gives  references  to  key  contributions  and  book-length 
treatments. 

11.2  Spline  Methods 

11.2.1  Piecewise  Polynomials  and  Splines 

For  continuous  responses,  splines  are  simply  linear  models,  with  an  enhanced  basis 
set  that  provides  flexibility.1  Let  hj(x')  :  M  — >  R  denote  the  jth  function  of  x,  for 
j  =  1 A  generic  linear  model  consists  of  the  linear  basis  expansion  in  x: 

J 

/0 )  = 

j=i 


1  Appendix  C  gives  a  brief  review  of  bases. 


J.  Wakefield,  Bayesian  and  Frequentist  Regression  Methods,  Springer  Series 
in  Statistics,  DOI  10.1 007/978- 1  -44 1 9-0925- 
©  Springer  Science+Business  Media  New  York  2013 
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Range  (m)  Range  (m) 

Fig.  11.1  Polynomial  fits  to  the  LIDAR  data:  (a)  quadratic,  (b)  cubic,  (c)  quartic,  and  (d)  degree-8 
polynomial 


An  obvious  choice  of  basis  is  a  polynomial  of  degree  J  —  1,  but  the  global  behavior 
of  such  a  choice  can  be  poor  in  the  sense  that  the  polynomial  will  not  provide  a  good 
fit  over  the  complete  range  of  x.  However,  local  behavior  can  be  well  represented 
by  relatively  low-order  polynomials. 


Example:  Light  Detection  and  Ranging 

Figure  11.1  shows  degree  2,  3,  4,  and  8  polynomial  fits  to  the  LIDAR  data.  The 
quadratic  and  cubic  models  fit  very  badly,  while  the  quartic  model  produces  a  poor 
fit  for  ranges  of  500-560  m.  The  degree-8  polynomial  fit  is  also  not  completely 
satisfactory  with  wiggles  at  the  extremes  of  the  range  variable  due  to  the  global 
nature  of  the  fitting. 

To  motivate  spline  models,  we  fit  piecewise-constant,  linear,  quadratic,  and  cubic 
models  using  least  squares,  with  three  pieces  in  each  case.  The  fits  are  displayed  in 
Fig.  11.2.  We  focus  on  the  piecewise  linear  model,  as  shown  in  Fig.  11.2(b).  By 
forcing  the  curve  to  be  continuous  but  only  allowing  linear  segments,  we  see  that 
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Fig.  11.2  Piecewise  polynomials  for  the  LIDAR  data:  (a)  constant,  (b)  linear,  (c)  quadratic,  and 
(d)  cubic 


the  fit  is  not  good  (particularly  in  the  first  segment).  The  lack  of  smoothness  is  also 
undesirable.  The  quadratic  and  cubic  fits  in  panels  (c)  and  (d)  are  far  more  appealing 
visually,  though  neither  provide  satisfactory  fits  because  we  have  only  allowed  three 
piecewise  polynomials.  In  particular,  in  panel  (d),  the  cubic  fit  is  still  poor  at  the  left 
endpoint.  □ 

We  now  start  the  description  of  spline  models  by  introducing  some  notation. 
Let  <  £2  <  ...  <  £/,  be  a  set  of  ordered  points,  called  knots,  contained  in 
some  interval  [a,  b\.  An  M-th  order  spline  is  a  piecewise  M  —  1  degree  polynomial 
with  M  —  2  continuous  derivatives  at  the  knots.2  Splines  are  very  popular  in 
nonparametric  modeling  though,  as  we  shall  see,  care  is  required  in  choosing  the 
degree  of  smoothing.  The  latter  depends  on  a  variety  of  factors  including  the  order 
of  the  spline  and  the  number  and  position  of  the  knots. 

We  begin  with  a  discussion  on  the  order  of  the  spline.  The  most  basic  piecewise 
polynomial  is  a  piecewise-constant  function,  which  is  a  first-order  spline.  With  two 
knots,  ^  and  £2,  one  possible  set  of  three  basis  functions  is 


2From  the  Oxford  dictionary,  a  spline  is  a  “flexible  wood  or  rubber  strip,  for  example,  used  in 
drawing  large  curves  especially  in  railway  work.” 
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hi(x)  =  I(x  <  £i),  h2{x)  =  I(fa  <  x  <  fa),  h3(x)=I{fa<x) 

where  /(•)  is  the  indicator  function.  Note  that  there  are  no  continuous  derivatives  at 
the  knots;  Fig.  1 1.2(a)  clearly  shows  the  undesirability  of  this  aspect. 

To  obtain  linear  models  in  each  of  the  intervals,  we  may  introduce  three 
additional  bases 

h3+j=hj(x)x,  j  =  1,2,3, 


to  give  the  model 

f{x)  =  /( x  <  fa)  (Pi  +  fax)  + 1  (fa  <  x  <  £2) (#2 +£5 x)  +  I(fa  <  x) (P3  +  fax) , 

which  contains  six  parameters.  Lack  of  continuity  is  a  problem  with  this  model,  but 
we  can  impose  two  constraints  to  enforce  /(£,?)  =  f(fa)  and  f(£2)  =  /(£^), 
which  imply  the  two  conditions 

/3l  +  Cl/?4  =  /?2  +  Cl/^5 

fa  +  £2/35  =  /?3  +  £2/36, 

to  give  four  parameters  in  total.  A  neater  way  of  incorporating  these  constraints  is 
with  the  basis  set: 

hi{x)  =  l,  h2(x)  =  x,  h3(x)  =  (x  -  £i)+,  h4(x)  =  (x  -  ^2)+  (H-l) 

where  t+  denotes  the  positive  part.  We  refer  to  the  generic  basis  (x  —  t;)+  as  a 
truncated  line?  The  resultant  function 

/( x)  =  fio  +  fax  +  fa{x  -  £1)+  +  fa{x  -  £2)+ 

is  continuous  at  the  knots  since  all  prior  basis  functions  are  contributing  to  the  fit 
up  to  any  single  x  value.  The  model  defined  by  the  basis  (1 1.1)  is  an  order-2  spline, 
and  the  first  derivative  is  discontinuous.  Figure  1 1.3  shows  the  basis  functions  for 
this  representation  and  Fig.  1 1.2(b)  the  fit  of  this  model  to  the  LIDAR  data. 

We  now  consider  how  the  piecewise  linear  model  may  be  extended.  Naively,  we 
might  assume  the  quadratic  form: 

f(x)  =P0  +  fax  +  fax2  +  fa{x  -  £1)+  +fa{x  -  £1)+  +fa{x  ~  £2)+  +fa{x-  &)+, 

(11.2) 

which  is  continuous  but  has  first  derivative 

f'(x)  =  fa+2fax  +  fal(x  >  fa)  +  2fa(x-fa)+  +  fa I(x  >  fa)  +  2fa(x-fa)+, 


3  It  is  conventional  to  define  the  truncated  lines  with  respect  to  bases  that  take  the  positive  part,  but 
we  could  have  defined  the  same  model  with  respect  to  bases  taking  the  negative  part. 
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Fig.  11.3  Basis  functions  for 
a  piecewise  linear  model  with 
two  knots  at  and  £ 2 ■  The 

solid  lines  are  the  bases  1  and 
x,  and  the  dashed  lines  are 
the  bases  (x  —  £1)+  and 
(x  -£2)+ 
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which  is  discontinuous  at  the  knot  points  £1  and  £2  due  to  the  linear  truncated 
bases  associated  with  ^3  and  /?5  in  (11.2).  This  lack  of  smoothness  at  the  knots  is 
undesirable.  Hence,  we  drop  the  truncated  linear  bases  to  give  the  regression  model 


/( x)  =  /30  +  fiix  +  p2x2  +  p3(x  -  £1)+  +  p4(x  -  £2)+ 


which  has  continuous  first  derivative: 


f\x)  =  /3i  +  2/32x  +  2/33(x  -  £1)+  +  2f34(x  -  £2)+. 


The  second  derivative  is  discontinuous,  however,  which  may  also  be  undesirable. 
Consequently,  a  popular  form  (which  we  justify  more  rigorously  shortly)  is  a  cubic 
spline.  We  will  concentrate  on  cubic  splines  in  some  detail,  and  so  we  introduce  a 
slight  change  of  notation  with  respect  to  the  truncated  cubic  parameters.  With  two 
knots  the  function  and  first  three  derivatives  are 


f{x)  =  fio  +  Pix  +  P2X2  +  (33x3  +  b1(x-  £1)+  +  b2{x  -  £2)+ 
f'(x)  =  pi  +  2 p2x  +  3p3x2  +  3bi(x  -  £1)+  +  3 b2{x  -  £2)+ 
f"(x)  =  2/32  +  6(33x  +  6bi(x  -  £i)+  +6b2(x  -  £2)+ 
f"{x)  =  6/?3  +  6b\I(x  >  £1)  +  6b2I{x  >  £2). 

The  latter  is  discontinuous,  with  a  jump  at  the  knots.  Figure  11.4  shows  the  basis 
functions  for  the  cubic  spline,  with  two  knots,  and  Fig.  11.2(d)  shows  the  fit  to  the 
LIDAR  data. 

For  L  knots,  we  write  the  cubic  spline  function  as 


L 


f(x)  =  Po  +  fhx  +  P2X2  +  p3x3  +  ^2  bi(x  ~  &)+> 


(11.3) 
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Fig.  11.4  Basis  functions  for  a  piecewise  cubic  spline  model  with  two  knots  at  and  £2-  Panel 
(a)  shows  the  bases  1,  x,  x2,  and  x 3  and  panel  (b)  the  bases  (a;  —  £i)^_  and  (a;  —  £2)+-  Note  that 
in  (b)  the  bases  have  been  scaled  in  the  vertical  direction  for  clarity 


so  that  we  have  L  +  4  coefficients.  The  key  to  implementation  is  to  recognize  that 
we  simply  have  a  linear  model,  f{x)  =  E[Y  |  z\  =  z 7,  where  2:  =  z(x)  and 

>0' 

ft 
ft 


1  X!  x\  x\  ( X 1  -  £1)+  . . .  (X!  -  £l)+  1 
1  x2  x\  x\  (x2  -  £1)+  ■  •  •  ft’2  -  £l)+ 


1  Xn  Xn  Xn  (xn  Cl)+  ■  •  •  ftri  C L ) 


7  = 


ft 

bi 

bL 


The  obvious  estimator  is  therefore  7  =  (z'z)  1  z'Y ,  which  gives  the  linear 
smoother  Y  =  SY,  where  S  =  z(zTz)~1zT. 


11.2.2  Natural  Cubic  Splines 

Spline  models  such  as  (1 1 .3)  can  produce  erratic  behavior  beyond  the  extreme  knots. 
A  natural  spline  enforces  linearity  beyond  the  boundary  knots,  that  is, 

f(x)  =  ai  +  a2x  for  x  <  £1 

f(x)  =  <23  +  04a;  for  x>^l- 

The  first  condition  only  considers  values  of  x  before  the  knots,  and  therefore,  the  bi 
parameters  in  (1 1.3)  are  irrelevant.  Consequently,  it  is  straightforward  to  see  that  we 
require 


ft  =  ft  =  0. 


(11.4) 
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For  x  > 


l 


f{x)  =  Po  +  pix  +  ^2  bi(x  -  Q)3 


1  =  1 
L 


=  Po  +  /3ix  +  ^2  b[(x3  -  3x2£,i  +  3x£f  -  £?), 


and  so,  for  linearity, 


L 


L 


E^  =  E^  =  °- 


(11.5) 


i=i  i=i 


Hence,  we  have  four  additional  constraints  in  total,  so  that  the  basis  for  a  natural 
cubic  spline  has  L  elements.  Exercise  1 1.3  describes  an  alternative  basis. 


11.2.3  Cubic  Smoothing  Splines 

So  far  we  have  examined  splines  in  a  heuristic  way,  as  flexible  functions  with  certain 
desirable  properties  in  terms  of  the  continuity  of  the  function  and  the  first  and  second 
derivatives  at  the  knots.  We  now  present  a  formal  justification  for  the  natural  cubic 
spline. 

Result.  Consider  the  penalized  least  squares  criterion 


(11.6) 


where  the  second  term  penalizes  the  roughness  of  the  curve  and  A  controls  the 
degree  of  this  roughness.  It  is  clear  that  without  the  penalization  term,  we  could 
choose  an  infinite  number  of  curves  that  interpolate  the  data  (in  the  case  of  unique  x 
values,  at  least),  with  arbitrary  behavior  in  between.  Quite  remarkably,  the  /(•)  that 
minimizes  (1 1.6)  is  the  natural  cubic  spline  with  knots  at  the  unique  data  points;  we 
call  this  function  g{ x). 

Proof.  The  proof  has  two  parts  and  is  based  on  Green  and  Silverman  (1994, 
Chap.  2).  We  begin  by  showing  that  a  natural  cubic  spline  minimizes  (11.6)  amongst 
all  interpolating  functions  and  then  extend  to  non-interpolating  functions.  Assume 
that  xi  <  ...  <  xn.  We  consider  all  functions  that  are  continuous  in  [xi,  xn]  with 
continuous  first  and  second  derivatives  and  which  interpolate  [xi,yi\,i  =  1, ...  ,n. 
Since  the  first  term  of  (11.6)  is  zero,  we  need  to  show  that  the  natural  cubic  spline, 
g(x),  minimizes 
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Let  g(x)  be  another  interpolant  of  (&»,  t/j),  and  define  h(x)  =  g(x)  —g(x).  Then, 
f  g"(x)2  dx  =  I  [g"{x)  +  h"(x)]2  dx 


=  /  [g"(x)2  +  2g"{x)h"{x)  +  h"(x)2]  dx. 

J  Xl 

Applying  integration  by  parts  to  the  cross  term, 

g"(x)h"(x)dx  =  [g"{x)h'{x)]xx-  -  H  g"'{x)h'{x)  dx 
J  xi  j  X 1 

=  —  f  g"'(x)h'(x)  dx  since  g"(x i)  =  g"(xn)  =  0 

J  X\ 

/■x.+  i 

=  -^2g'"(x+)  h'{x)  dx 

i= 1  Jx ‘ 

since  g'"{x)  is  constant  in,  and  xf  is  a  point  in,  [xj,  Xi+ 1] 

n—  1 

=  -  [h(xi+ 1)  -  h(xi)] 


=  0 

since  h(Xi+i)  =  g(xi+ 1)  —  g{xi+ 1)  and  both  are  interpolants  (and  similarly  for 
h(xi)).  We  have  shown  that 

j  g"(x)2dx=  f  g”(x)2dx+  f  h"(x)2  dx 

J  X\  j  Xl  j  Xl 


>  /  g"{x)  dx 


with  equality  if  and  only  if  h"(x)  =  0  for  X\  <  x  <  xn.  The  latter  implies 
h(x)  =  a  +  bx,  but  h(x i)  =  h{xn)  =  0,  and  so  a  =  b  =  0.  Consequently,  any 
interpolant  that  is  not  identical  to  g{x)  will  have  a  higher  integrated  squared  second 
derivative.  Therefore,  the  natural  cubic  spline  with  knots  at  the  unique  x  values  is 
the  smoothest  interpolant  in  the  sense  of  minimizing  f  f"{x)2  dx.  This  is  of  use  in, 
for  example,  numerical  analysis,  where  interpolation  of  [xi,yi]  is  of  interest.  But, 
in  statistical  applications,  the  data  are  measured  with  error,  and  we  typically  do  not 
wish  to  restrict  attention  to  interpolating  functions.4 


4There  are  some  analogies  here  with  bias,  variance,  and  mean  squared  error.  The  penalized  sum  of 
squares  (11.6)  is  analogous  to  the  mean  squared  error,  and  interpolating  functions  are  “unbiased” 
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We  have  shown  that  a  natural  cubic  spline  minimizes  (11.6)  amongst  all 
interpolating  functions  but  the  minimizing  function  need  not  necessarily  be  an 
interpolant  since  an  interpolating  function  may  have  a  large  associated  penalty 
contribution.  The  second  part  of  the  proof  considers  functions  that  do  not  necessarily 
interpolate  the  data  but  have  n  free  parameters  g(xi)  with  the  aim  being  minimiza¬ 
tion  of  (11.6).  The  resulting  g{x)  is  known  as  a  smoothing  spline.  Suppose  some 
function  f*(x),  other  than  the  cubic  smoothing  spline,  minimizes  (1 1.6).  Let  g(x) 
be  the  natural  cubic  spline  that  interpolates  [xi,  f*(xi)  ],  i  =  1, ...  ,n.  Obviously, 
/*  and  g  produce  the  same  residual  sum  of  squares  in  (1 1.6)  since  f*(xi)  =  g{xi). 
But,  by  the  first  part  of  the  proof. 


Hence,  the  natural  cubic  spline  is  the  function  that  minimizes  (11.6);  this  spline  is 
known  as  a  cubic  smoothing  spline. 

The  above  result  has  shown  us  that  if  we  wish  to  minimize  (1 1.6),  we  should  take 
as  model  class  the  cubic  smoothing  splines.  The  coefficient  estimates  of  the  fit  will 
depend  on  the  value  chosen  for  A.  We  stress  that  the  fitted  natural  cubic  smoothing 
spline  will  not  typically  interpolate  the  data,  and  the  level  of  smoothness  will  be 
determined  by  the  value  of  A  chosen.  Small  values  of  A,  which  correspond  to  a  large 
effective  degrees  of  freedom  (Sect.  10.5.1),  impose  little  smoothness  and  bring  the 
fit  closer  to  interpolation,  while  large  values  will  result  in  the  fit  being  close  to  linear 
in  x  (in  the  limit,  a  zero  second  derivative  is  required). 

In  terms  of  interpretation,  if  a  thin  piece  of  flexible  wood  (a  mechanical  spline)  is 
placed  over  the  points  [ x*,  yi  ],  i  =  1 , ...  ,n,  then  the  position  taken  up  by  the  piece 
of  wood  will  be  of  minimum  energy  and  will  describe  a  curve  that  approximately 
minimizes  f  f  "2  over  curves  that  interpolate  the  data. 


Example:  Light  Detection  and  Ranging 

We  fit  a  natural  cubic  spline  to  the  LIDAR  data.  Figure  11.5  shows  the  ordinary 
and  generalized  cross-validation  scores  (as  described  in  Sects.  10.6.2  and  10.6.3, 
respectively)  versus  the  effective  degrees  of  freedom.  The  curves  are  very  similar 
with  well-defined  minima  since  these  data  are  abundant  and  the  noise  level  is 
relatively  low.  The  OCV  and  GCV  scores  are  minimized  at  9.3  and  9.4  effective 
degrees  of  freedom,  respectively.  Figure  11.6  shows  the  fit  (using  the  GCV 
minimum  corresponding  to  A  =  959),  which  appears  good.  In  particular,  we  note 
that  the  boundary  behavior  is  reasonable. 


but  may  have  large  variability.  However,  we  can  obtain  a  better  estimator  if  we  are  prepared  to 
examine  “biased”  (i.e.,  non-interpolating)  functions. 
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Fig.  11.5  Ordinary  and 
generalized  cross-validation 
scores  versus  effective 
degrees  of  freedom  for  the 
LIDAR  data  and  a  natural 
cubic  spline  model 


in 


Fig.  11.6  Cubic  spline  fits  to 
the  LIDAR  data.  The  natural 
cubic  spline  fit  has  smoothing 
parameter  chosen  by 
generalized  cross-validation. 
The  mixed  model  cubic 
spline  has  smoothing 
parameter  chosen  by  REML 
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11.2.4  B-Splines 

There  are  many  ways  of  choosing  a  basis  to  represent  a  cubic  spline;  the  so-called 
/i-spline  basis  functions  are  popular,  a  primary  reason  being  that  they  are  nonzero 
over  a  limited  range  which  aids  in  computation,  /i-splines  also  form  the  building 
blocks  for  other  spline  models  as  we  describe  in  Sect.  11.2.5.  The  classic  text  on 
.B-splines  is  de  Boor  (1978). 

/i-splines  are  available  for  splines  of  general  order,  which  we  again  denote  by 
M  (so  that  for  a  cubic  spline,  M  =  4).  The  number  of  basis  functions  is  L  +  M 
since  we  have  anilf-1  degree  polynomial  (giving  M  bases)  and  one  basis  for  each 
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knot.  The  original  set  of  knots  are  denoted  l  =  1 ,L,  and  we  let  £o  <  £i  and 
£l  <  £l+ i  represent  two  boundary  knots.  We  define  an  augmented  set  of  knots,  tj, 
j  =  1, ...  ,L  +  2 M,  with 

Tl  <  T2  <  .  .  .  <  TM  <  Co 

Tj+M  =  £j,  j =  1)  •  •  • ,  L 

£l+1  <  Tl+M+1  <  Tl+M+2  <  ■  ■  ■  <  TL+2M 


where  the  choice  of  the  additional  knots  is  arbitrary  and  so  we  may,  for  example, 
set  Tl  =  . . .  =  rM  =  £o  and  £l+i  =  tl+m+  i  =  ■  ■  ■  =  tl+2m ■  These  additional 
knots  ensure  the  basis  functions  detailed  below  are  defined  close  to  the  boundaries. 
To  construct  the  bases,  first  define 


B){ x)  =  \  l  lfl3  ~X  <  Tj+1 

J  ^0  otherwise 

for  j  =  2, . . . ,  L  +  2 M  —  1.  For  1  <  m  <  M,  define 


(11.7) 


B?{x)  = 


X  —  Tj 


'Tj+m—l 


:Brl 


Tj+m  X  j-.rn.-l 


B 


Tj+m  Tj+ 1 


7+1 


(11.8) 


for  j  =  \ .....  L  +  2M  —  m.  If  we  divide  by  zero,  then  we  define  the  relevant  basis 
element  to  be  zero.  The  /i-splinc  bases  are  nonzero  over  a  domain  spanned  by  at 
most  M  +  1  knots.  For  example,  the  support  of  cubic  /i-spliiics  (M  =  4)  is  at  most 
five  knots.  At  any  x ,  M  of  the  73-splines  are  nonzero. 

The  cubic  73-spline  model  is 


L+ 4 

f{x)  =  YJBA3(x)pr  (11.9) 

3=1 

For  further  details  on  computation,  see  Hastie  et  al.  (2009,  p.  186).  Figure  11.7 
shows  the  cubic  73-spline  basis  (including  the  intercept)  for  L  =  9  knots. 


11.2.5  Penalized  Regression  Splines 

Although  the  result  of  Sect.  11.2.3  is  of  theoretical  interest,  in  general,  we  would 
like  to  have  a  functional  form  that  has  less  parameters  than  data  points.  Regresssion 
splines  are  defined  with  respect  to  a  reduced  set  of  L  <  n  knots.  Automatically 
deciding  on  the  number  and  location  of  knots  is  difficult.  For  example,  starting 
with  n  knots  and  then  selecting  via  stepwise  methods  (Sect.  4.8.1)  is  fraught 
with  difficulties  since  there  are  2n  models  to  choose  from  (assuming  the  intercept 
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Fig.  11.7  B-spline  basis 
functions  corresponding  to  a 
cubic  spline  (M  =  4)  with 
L  =  9  equally  spaced  knots 
(whose  positions  are  shown 
as  open  circles  on  the  rr-axis). 
There  are  L  +  M  =  13  bases 
in  total.  Note  that  six  distinct 
line  types  are  used  so  that,  for 
example,  there  are  three 
splines  represented  by  solid 
curves:  the  leftmost,  the 
central,  and  the  rightmost 
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and  linear  terms  are  always  present).  An  alternative  penalized  regression  spline 
approach,  with  L  <  n  knots,  is  to  choose  sufficient  knots  for  flexibility  and  then  to 
penalize  the  parameters  associated  with  the  knot  bases.  If  this  approach  is  followed, 
the  number  and  selection  of  knots  is  far  less  important  than  the  choice  of  smoothing 
parameter.  An  obvious  choice  is  to  place  an  L2  penalty  on  the  coefficients,  that  is,  to 
include  the  term  A  XwLi  in  a  penalized  least  squares  form.  So-called  low-rank 
smoothers  use  considerably  fewer  than  n  basis  functions. 

We  now  consider  linear  smoothers  of  the  form: 

J 

f(x )  =  hi(x)Pi  =  h(x)P> 

i=l 

^-T 

where  h(x)  is  a  1  x  J  vector.  A  general  penalized  regression  spline  is  f3  h(x), 
where  /3  is  the  minimizer  of 

n 

to  -  +  A/3T£>/3,  (11.10) 

i— 1 

with  hi  =  h(xi),  D  is  a  symmetric-positive  semi-definite  matrix,  and  A  >  0  is  a 
scalar.  If  we  let  h  =  [hi , . . . ,  hn]T  represent  the  n  x  J  design  matrix,  then 

3=  (hTh  + (11.11) 


The  penalty 


A 


f"(x)2dx 


(11.12) 
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is  of  the  form  (1 1.10)  since,  for  a  linear  smoother  f(x). 


J  f"(x)2dx  =  f¥  \j 


h"(x)h"(x)Tdx  (3 


=  (3TD(3 


with  D  a  matrix  of  known  coefficients.  The  penalty  is  measuring  complexity:  For 
A  =  0,  there  is  no  cost  to  fitting  a  very  complex  function,  while  A  =  oo  gives  the 
simple  linear  least  squares  line. 

O’Sullivan  splines  (O’Sullivan  1986)  use  the  cubic  /i-splinc  basis  representation 
(11.9),  combined  with  the  penalty  (11.12),  which  takes  the  form: 


Hence,  the  penalty  matrix  D  has  (j,  A:)-th  element  j  Bj(x)"Bf.(x)"  dx.  O’Sullivan 
splines  correspond  to  cubic  smoothing  splines  for  L  =  n  and  distinct  x,  (Green  and 
Silverman  1994,  Sect.  3.6). 

The  construction  of  P-splines  is  based  on  a  different  penalty  in  which  a  set  of 
/j-spline  basis  functions  are  used  with  a  collection  of  equally  spaced  knots  (Eilers 
and  Marx  1996).  The  form  of  the  penalty  is 


j 


(11.13) 


j=k+ 1 


with  Afjj  =  /3j  —  fdj- 1,  the  difference  operator,  and  where  k  is  a  positive  integer. 
For  k  =  2,  the  penalty  is 


j-i 


^  Ytfj+x  ~  ^o)2  —  Pi  —  2/3i/?2  +  2/3|  +  . . .  +  2 f3j_1  —  2/3j_i/3j  +  /3j, 


i= i 

which  corresponds  to  the  general  penalty  with 

f  1-1  0  •  -1 


This  form  penalizes  large  changes  in  adjacent  coefficients,  providing  an  alternative 
representation  of  smoothing.  The  P-spline  approach  was  heavily  influenced  by  the 
derivation  of  O’Sullivan  splines  (O’Sullivan  1986),  and  the  P-spline  penalty  is  an 
approximation  to  the  integrated  squared  derivative  penalty.  See  Eilers  and  Marx 
(1996)  for  a  careful  discussion  of  the  two  approaches.  Wand  and  Ormerod  (2008) 
also  contrast  O’Sullivan  splines  (which  they  refer  to  as  O-splines)  with  P-splines 
and  argue  that  O-splines  are  an  attractive  option  for  nonparametric  regression. 
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With  respect  to  penalized  regression  splines,  a  number  of  suggestions  exist  for 
the  number  and  location  of  the  knots.  For  example,  Ruppert  et  al.  (2003)  take  as 
default  choice: 


L  =  min  x  number  of  unique  Xi,  35  J  , 

with  knots  taken  at  the  (l  +  1  )/(L  +  2)th  points  of  the  unique  Xi.  These  authors 
say  that  these  choices  “work  well  in  most  of  the  examples  we  come  across”  but  urge 
against  the  unquestioning  use  of  these  rules. 


11.2.6  A  Brief  Spline  Summary 

The  terminology  associated  with  splines  can  be  confusing,  so  we  provide  a  brief 
summary.  For  simplicity,  we  assume  that  the  covariate  x  is  univariate  and  that 
Xi, ...  ,xn  are  unique.  A  smoothing  spline  contains  n  knots,  and  a  cubic  smoothing 
spline  is  piecewise  cubic.  A  natural  spline  is  linear  beyond  the  boundary  knots.  If 
there  are  L  <  n  knots,  we  have  a  regression  spline.  A  penalized  regression  spline 
imposes  a  penalty  on  the  coefficients  associated  with  the  piecewise  polynomial.  The 
penalty  terms  may  take  a  variety  of  forms. 

The  number  of  basis  functions  that  define  the  spline  depends  on  the  number  of 
knots  and  the  degree  of  the  polynomial;  natural  splines  have  a  reduced  number  of 
bases.  Spline  models  may  be  parameterized  in  many  different  ways. 


11.2.7  Inference  for  Linear  Smoothers 

Nonparametric  regression  may  be  used  for  a  variety  of  purposes.  The  simplest 
use  is  as  a  scatterplot  smoother  for  pure  exploration.  In  such  a  context,  a  plot  of 
f(x)  versus  x  is  perhaps  all  that  is  required.  In  other  instances,  we  may  wish  to 
produce  interval  estimates,  either  pointwise  or  simultaneous,  in  order  to  examine 
the  uncertainty  as  a  function  of  x. 

We  consider  linear  smoothers  with  J  basis  functions  and  write  f{x)  =  h(x)(3 
for  a  prediction  at  x  with  (3  a  J  x  1  vector  and  h(x')  the  J  x  1  design  matrix 
associated  with  x.  Further,  assume  Y(x)  =  f[x)  +  e(x),  with  the  error  terms  e(x) 
uncorrelated  and  with  constant  variance  er2.  We  emphasize  that  J  is  not  equal  to 
the  effective  degrees  of  freedom,  which  is  given  by  pA)  =  tr  [ h{h7h  +  A £3)_1ft,T] 
where  h  =  [h(: xf), . . .  ,/i(in)]T.  Differentiation  of  (11.10)  with  respect  to  f3  and 
setting  equal  to  zero  gives 
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Assuming  a  fixed  A,  asymptotic  inference  for  (3  is  straightforward  since 

[{hTh  +  AD)_1hT/i(hTh  +  A  D)-1} _1/2  (3  -  ->  Nj(0,  cr2I). 

In  a  nonparametric  regression  context,  interest  often  focuses  on  inference  for  the 
underlying  function;  we  first  consider  inference  at  a  single  point  x,  f(x). 

Since  the  estimator  is  linear  in  the  data, 

n 

fix)  =  h(x)  3  =  S(x)Y  =  Y,Si{x)Yi  (11.14) 

*=1 

where  S{ x)  =  h(x)(hTh  +  A £?)_1/rT  is  the  1  x  n  vector  with  elements  Si(x), 
i  =  1 , ,n.  This  estimator  has  mean 


E 


=  '52Si(x)f(xi) 

i= 1 


and  variance 

n 

var  (/(*))  =  <j2^2Sl(x)2  =  a2\\S{x)\\2.  (11.15) 

i- 1 

A  major  difficulty  with  (11.14)  is  that  there  will  be  bias  b(x)  present  in  the 
estimator.  If  this  bias  were  known,  then 


f{x)  -  f{x)  -  b{x) 
cr||S'(a;)|| 


N(0, 1), 


(11.16) 


via  a  central  limit  theorem.  Note  that  it  is  “local”  sample  size  that  is  relevant  here, 
with  a  precise  definition  depending  on  the  smoothing  technique  used  (which  defines 
S{x).  Estimation  of  the  bias  is  difficult  since  it  involves  estimation  of  /"( x)  (for  a 
derivation  in  the  context  of  density  estimation,  see  Sect.  1 1.3.4). 

Often  the  bias  is  just  ignored.  The  interpretation  of  the  resultant  confidence 


intervals  is  that  they  are  confidence  intervals  for  f{x)  =  E  fix) 


which  may 


be  thought  of  as  a  smoothed  version  of  fix).  We  have 


fjx)  -  fix)  =  fix)  -  fjx)  fix)  -  fix) 

oj|S(a;)||  o-||S(a;)||  cr||S(a;)|| 


—  Znix)  + 


b(x) 

<3is(*)ir 


(11.17) 


which  is  a  restatement  of  (11.16)  and  where  Znix)  converges  to  a  standard 
normal.  Hence,  a  100(1  —  a)%  asymptotic  confidence  interval  for  fix)  is  fix)  ± 
cacr||S,(x)||,  where  ca  is  the  appropriate  cutoff  point  of  a  standard  normal  distribu¬ 
tion.  In  parametric  inference,  the  bias  is  usually  much  smaller  than  the  standard 
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deviation  of  the  estimator,  so  the  bias  term  goes  to  zero  as  the  sample  size 
increases.5  In  a  smoothing  context,  we  have  repeatedly  seen  that  optimal  smoothing 
corresponds  to  balancing  bias  and  variance,  and  the  second  term  does  not  disappear 
from  (11.17),  even  for  large  sample  sizes  (recall  that  S(x)  will  depend  on  A,  whose 
choice  will  depend  on  sample  size). 

We  now  turn  to  simultaneous  confidence  bands  of  the  function  /( x)  over  an 
interval  x  €  [a,  6]  with  a  =  rriin(a:'j)  and  b  =  max(a;i),  i  =  In 

the  following,  we  will  assume  that  the  confidence  bands  are  for  the  smoothed 


function /(x)  =  E  f(x) 


,  thus  sidestepping  the  bias  issue.  We  again  assume  linear 


smoothers  so  that  (11.14)  holds. 

One  way  to  think  about  a  simultaneous  confidence  band  is  to  begin  with  a  finite 
grid  of  x  values:  Xj  =  a  +  j{b  —  a)/m,  j  =  1 , ,m.  Now  suppose  we  wish 
to  obtain  a  simultaneous  confidence  band  for  f(xj),  j  =  1, . . .  ,  to.  One  way  of 
approaching  this  problem  is  to  consider  the  probability  that  each  of  the  to  estimated 
functions  simultaneously  lie  within  c  standard  errors  of  /,  that  is, 


m 


n 


f(Xj)  -  /( Xj) 

cr||S,(a;j)|| 


where  c  is  chosen  to  correspond  to  the  required  1  —  a  level  of  the  confidence 
statement.  Then 


Pr 


f(xj)  -  f{Xj) 
cr||S'(a:J)|| 


f(Xj)  -  fjxj) 

cr||S'(a:J)|| 


(11.18) 


Now  suppose  that  to  — >  oo  to  give  the  limiting  expression  for  (1 1 . 1 8)  as 

f(x)  -  f(x) 


Pr  sup 

\  x€[a,b] 


cr||S'(x)|| 


<  c  =  Pr(M  <  c). 


Sun  and  Loader  (1994),  following  Knafl  et  al.  (1985),  considered  approximating  this 
probability  in  the  present  context.  Let  T(x)  =  5(a;)/||S'(at)||.  Based  on  the  theory 
of  Gaussian  processes, 

Pr(M  >  c)  ~  2  [1  —  $>(c)]  +  —  exp(— c2/2), 

7T 


where 


k0=  f  ||T,(a;)||  dx, 


5With  parametric  models,  we  are  often  interested  in  simple  models  with  a  fixed  number  of 
parameters,  even  if  we  know  they  are  not  “true”.  For  example,  when  we  carry  out  linear  regression, 
we  do  not  usually  believe  that  the  “true”  underlying  function  is  linear;  rather,  we  simply  wish  to 
estimate  the  linear  association. 
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T'(x)  =  [T{(x), . . . ,  T^(x)Y  and  T'(x)  =  dTi(x)/dx  for  i  =  1, . . . ,  n.  We  choose 
c  to  solve 

a  =  2  [1  —  <£(c)l  +  —  exp(— c2/2),  (11.19) 

7 r 

and  no  may  be  evaluated  using  numerical  integration  over  a  grid  of  x  values.  To 
summarize,  once  an  a  level  is  chosen,  we  obtain  kq  and  c  and  then  form  bands 
f(x)±ca\\S(x)\\. 

In  the  case  of  nonconstant  variance,  we  replace  a  by  cr(.x').  Section  1 1.4  contains 
details  on  estimation  of  the  error  variance.  Throughout  this  section,  we  have 
conditioned  upon  a  A  value,  which  is  usually  estimated  from  the  data.  Hence,  in 
practice,  the  uncertainty  in  A  is  not  accounted  for  in  the  construction  of  interval 
estimates.  A  Bayesian  mixed  model  approach  (Sect.  1 1.2.9)  treats  A  as  a  parameter, 
assigns  a  prior,  and  then  averages  over  the  uncertainty  in  A  in  subsequent  inference. 

In  some  contexts,  interest  may  focus  on  testing  the  adequacy  of  a  parametric 
model,  comparing  nested  smoothing  models,  or  testing  whether  the  relationship 
between  the  expected  response  and  x  is  flat.  In  each  of  these  cases,  likelihood 
ratio  or  F  tests  can  be  performed  (see,  e.g.,  Wood  2006,  Sect.  4.8.5),  though  the 
nonstandard  context  suggests  that  the  significance  of  test  statistics  should  be  judged 
via  simulation. 


Example:  Light  Detection  and  Ranging 

We  fit  a  cubic  penalized  regression  spline,  with  penalization  A  bf  ancl  A  esti¬ 
mated  using  generalized  cross-validation.  Figure  11.8(a)  gives  pointwise  confidence 
intervals  and  simultaneous  confidence  bands  under  the  assumption  of  constant 
variance.  Figure  11.8(b)  presents  the  more  appropriate  intervals  with  allowance 
for  nonconstant  variance  (for  details  on  how  <r(x)  is  estimated,  see  the  example 
at  the  end  of  Sect.  11.4).  The  coverage  probability  is  0.95,  and  the  critical  value 
for  c  is  1.96  for  the  pointwise  intervals  and  3.1 1  for  the  simultaneous  intervals,  as 
calculated  from  (11.19),  with  «:()  estimated  as  15.4.  Under  a  nonconstant  variance 
assumption,  the  intervals  are  very  tight  for  low  ranges  and  increase  in  width  as  the 
range  increases. 

11.2.8  Linear  Mixed  Model  Spline  Representation:  Likelihood 
Inference 

In  this  section  we  describe  an  alternative  mixed  model  framework  for  the  rep¬ 
resentation  of  regression  spline  models.  A  benefit  of  this  framework  is  that  the 
smoothing  parameter  may  be  estimated  using  standard  inference  (e.g.,  likelihood 
or  Bayesian)  techniques.  It  is  also  possible  to  build  complex  mixed  models  that  can 
model  dependencies  within  the  data  using  random  effects,  in  addition  to  performing 
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Fig.  11.8  Pointwise 
confidence  intervals  and 
simultaneous  confidence 
bands  for  f{x)  for  the  LIDAR 
data  under  the  assumption  of 

(a)  homoscedastic  errors  and 

(b)  heteroscedastic  errors 


400  450  500  550  600  650  700 


400  450  500  550  600  650  700 


Range  (m) 


the  required  smoothing.  In  the  following,  we  lean  heavily  on  the  material  on  linear 
random  effects  modeling  contained  in  Chap.  8.  Consider  the  (j>  + 1  )th-order  (degree 
p  polynomial)  penalized  regression  spline  with  L  knots,  that  is, 

L 

f(x)  =  /3q  +  Pix  +  . . .  +  I3pxp  +  '£bi{x-  &)+• 

i=i 

A  penalized  least  squares  approach  with  L2  penalization  of  the  L  truncated  cubic 
coefficients  leads  to  minimization  of 
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n  L 

Ysiyi-ViP-Zib?  +  A  53  6?,  (11.20) 

i=i  i=i 


where 


'/V 

L  1 

Xi  —  [1,  .  . 

II 

Pi 

,  Zi  =  [  . 

■  ,(xi-^L)p+],  b  = 

°1 

-Pp. 

bL_ 

Let  D  =  diag(0p+i,  lx)  and  c  be  the  n  x  {p  +  1  +  L)  matrix  with  tth  row 
Ci  =  . .  ,x?,(xi  -  )p+,...,(xi  -  £,l)+],  so  that  c  =  [x,z],  where 

x  =  [x\, . . . ,  a:„]T  and  z  =  [zi, . . . ,  z„]T.  The  penalized  sum  of  squares  (1 1.20) 
can  be  written  as 

(■ y  -  ciY(y  -  ci)  +  ^iTDi,  (ii.2i) 


where  7  =  [j3,  b]T. 

We  now  reframe  this  approach  in  mixed  model  form  with  mean  model 

Vi  =  f(xi)  +  e» 

=  Xif3  +  Zib  +  ei, 

and  covariance  structure  and  distributional  form  determined  by  e?;  |  of  ~ii(j 
N(0,of)  and  6/  |  of  d  N(0,  of)  with  and  6/  independent,  i  =  1 

l  =  1, . . . ,  L.  This  formulation  sheds  some  light  on  the  nature  of  the  penalization. 
Since  the  distribution  of  bi  is  independent  of  by  for  l  7^  l',  we  are  assuming  that 
the  size  of  the  contribution  due  to  the  Zth  basis  is  not  influenced  by  any  other 
contributions,  in  particular,  the  closest  (in  terms  of  x)  basis.  For  example,  knowing 
the  sign  of  hi  _  1  does  not  imply  we  believe  that  6;  is  of  the  same  sign.  This  is  in 
contrast  to  the  P-spline  difference  penalty  described  in  Sect.  1 1.2.5. 

Minimization  of  (1 1.21)  with  respect  to  (3  and  b  is  then  equivalent  to  minimiza¬ 
tion  of 

■K  (y  —  x(3  —  zb)T(y  —  x(3  —  zb)  +  — |-hTf> 

so  that  A  =  of  / of .  We  summarize  likelihood-based  inference  for  this  linear 
mixed  model;  Sect.  8.5  contains  background  details.  The  maximum  likelihood  (ML) 
estimate  of  /3  is 

P=(xcV-1x)~1xcV~1Y  (11.22) 

where  V  =  ofzzT  +  of I„,  and  the  best  linear  unbiased  predictor  (BLUP) 
estimator/predictor  of  b  is 


b  =  alzTV  1(y  —  x(3) 


(11.23) 


566 


1 1  Spline  and  Kernel  Methods 


Let  tig  and  of  be  the  restricted  maximum  likelihood  (REML)  estimators  (see 
Sect.  8.5.3)  of  a \  and  of  so  that 


In  practice,  we  use 

3  =  (xIV-1x)-1xTV~1Y 
b  =  alzTV~1{y  -  x(3). 

The  (penalized)  estimator  of  7  =  [  (3,  b  ]T  can  be  written  as 

7  =  (cTc+  XD)-1^  (11.24) 

(Exercise  1 1.2).  Hence,  we  can  write  the  fitted  values  as  the  linear  smoother: 

/  =  07  =  SWY 
=  c(cTc  +  A  D)~1cIY . 

The  degrees  of  freedom  of  the  model  is  defined  as 
df(A)  =  tr  (S(A)) 

=  tr  [c(cTc  +  AD)_1cT]  .  (1 1.25) 

We  consider  inference  for  a  particular  value  x: 

/( x)  =  x{x)(3  +  z(x)b 
=  c(x)  7 

=  c(x)(cTc  +  A  D)~1cTY 

where  x ( x )  =  [  1,  x, . . . ,  xp  ],  z(x)  =  [  (x  —  £i)p, . . . ,  (x  —  £l)p  ]  and  c(x)  = 
[ x(x),z(x )]. 

The  variance,  conditional  on  b,  is 

var  (^f(x)  |  b^j  =  cr^c(x)(cTc  +  \D)~1cTc(crc  +  \D)~1c(x)T, 

which  is  identical  to  the  variance  obtained  from  ridge  regression  (10.30).  Ruppert 
et  al.  (2003,  Sect.  6.4)  argue  for  conditioning  on  b  to  give  the  appropriate  measure  of 
variability.  Specifically,  they  state  (in  the  notation  used  here):  “Randomness  of  b  is 
a  device  used  to  model  curvature,  while  e  accounts  for  variability  about  the  curve.” 
Asymptotic  95%  pointwise  confidence  intervals  for  f(x)  are 

f(x)  ±  1.96  x  y'var^aoj. 

Approximate  or  fully  Bayesian  approaches  to  confidence  interval  construction  for 
the  complete  curve  have  been  recently  advocated  and  have  shown  to  be  accurate  in 
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simulation  studies;  see  Chap.  17  of  Ruppert  et  al.  (2003)  and  the  detailed  account 
of  Marra  and  Wood  (2012).  These  accounts  build  upon  the  work  of  Wabha  (1983); 
Silverman  (1985),  and  Nychka  (1988).  The  latter  showed,  for  univariate  x,  that  a 
Bayesian  interval  estimate  of  the  curve,  constructed  using  a  cubic  smoothing  spline, 
has  good  frequentist  coverage  probabilities  when  the  bias  in  curve  estimation  is 
a  small  contributor  to  the  overall  mean  squared  error.  In  this  case,  the  average 
posterior  variance  is  a  good  approximation  to  the  mean  squared  error  of  the 
collection  of  predictions.  Marra  and  Wood  (2012)  provide  a  far-ranging  discussion 
of  Bayesian  confidence  interval  construction,  in  the  context  of  generalized  additive 
models,  as  described  in  Sect.  12.2;  included  is  a  discussion  of  when  the  coverage 
probability  of  the  interval  is  likely  to  be  poor,  one  instance  being  when  a  relatively 
large  amount  of  bias  occurs,  for  example,  when  one  over-smooths. 

Tests  of  the  adequacy  of  a  parametric  model  or  of  a  null  association  via  likelihood 
ratio  and  F  tests  are  described  in  Ruppert  et  al.  (2003,  Sects.  6.6  and  6.7).  We 
illustrate  confidence  interval  construction  with  an  example. 


Example:  Light  Detection  and  Ranging 

We  fit  a  cubic  spline  with  20  equally  spaced  knots  (so  that  we  have  4  fixed  effects 
and  20  random  effects)  with  REML  estimation  of  the  smoothing  parameter.  The 
resultant  fit  is  shown  in  Fig.  11.6  as  a  dashed  line.  The  variance  components 
are  estimated  as  erf  =  0.0792  and  cx2  =  0.0122,  to  give  smoothing  parameter 
A  =  45.8,  which  equates  to  an  effective  degrees  of  freedom  of  8.5.  This  is  quite 
similar  to  the  effective  degrees  of  freedom  of  9.4  that  was  chosen  by  GCV  for 
the  natural  cubic  spline  fit,  which  is  also  shown  in  Fig.  1 1.6.  The  fits  are  virtually 
indistinguishable,  which  is  reassuring.  Again  we  point  out  that  this  analysis  ignores 
the  clear  heteroscedasticity  in  these  data.  Within  the  linear  mixed  model  framework, 
it  would  be  natural  to  assume  a  parametric  or  nonparametric  model  for  aj  as  a 
function  of  x. 

In  Fig.  1 1.9,  we  display  the  contributions  bi(x  —  from  the  l  =  1, . . . ,  20, 
truncated  cubic  segments.  The  contribution  from  the  fixed  effect  cubic,  /3o  +  Pix  + 
P2X2  +  /'if 3 3 ,  is  shown  as  the  solid  line  in  each  of  the  plots  in  this  figure.  The  1st 
and  16th-20th  cubic  segments  offer  virtually  no  contribution  to  the  fit,  while  the 
contribution  of  the  4th- 14th  segments  is  considerable,  which  reflects  the  strong  rate 
of  change  in  the  response  between  ranges  of  550  m  and  650  m. 


11.2.9  Linear  Mixed  Model  Spline  Representation:  Bayesian 
Inference 

We  now  discuss  a  Bayesian  mixed  model  approach.  The  model  is  the  same  as  in 
the  last  section,  with  carefully  chosen  priors.  We  will  not  discuss  implementation  in 
detail,  but  lean  on  the  INLA  method  described  in  Sect.  3.7.4. 
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1st 


2nd 


3rd 


4th 


Fig.  11.9  Contributions  of  the  20  spline  bases  to  the  linear  mixed  model  fit  to  the  LIDAR  data.  The 
cubic  fixed  effects  fitted  line  is  drawn  as  the  solid  line  on  each  plot,  and  the  20  contributions  from 
each  of  the  truncated  cubic  segments  are  drawn  as  dotted  lines  on  each  plot.  The  dotted  vertical 
line  on  each  plot  indicates  the  knot  location  associated  with  the  truncated  line  segment  displayed 
in  that  plot 
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Prior  distributions  on  smoothing  parameters  have  the  potential  to  increase  the 
stability  of  the  fit,  if  the  priors  are  carefully  specified.  An  approach  suggested  by 
Fong  et  al.  (2010)  is  to  place  a  prior  on  of  and  examine  the  induced  prior  on  the 
effective  degrees  of  freedom,  a  more  easily  interpretable  quantity.  The  idea  is  to 
experiment  with  prior  choices  on  of  until  one  settles  on  a  prior  on  the  effective 
degrees  of  freedom  that  one  is  comfortable  with.  The  effective  degrees  of  freedom 
is  given  by  (1 1.25)  and  can  be  rewritten  as 

df(A)  =  tr[(cTc+  A-D)_1cTc]. 

The  total  degrees  of  freedom  can  be  decomposed  into  the  degrees  of  freedom 
associated  with  f3  and  b.  This  decomposition  can  be  extended  easily  to  situations 
in  which  we  have  additional  random  effects  beyond  those  associated  with  the 
spline  basis.  In  each  of  these  situations,  the  degrees  of  freedom  associated  with  the 
respective  parameter  are  obtained  by  summing  the  appropriate  diagonal  elements  of 
(cTc+  XD)~1crc.  Specifically,  for  d  sets  of  parameters,  let  Ej  be  the  (p+ 1  +  L)  x 
(p  +  1  +  L)  diagonal  matrix  with  ones  in  the  diagonal  positions  corresponding  to 
set  j,  j  =  1, ...  ,d.  Then,  the  degrees  of  freedom  associated  with  this  set  are 

dfj(A)  =  tr[£,j(cTc+  Afl)_1cTc]. 

Note  that  the  effective  degrees  of  freedom  change  as  a  function  of  L,  as  expected. 
To  evaluate  A,  of  is  required;  Fong  et  al.  (2010)  recommend  the  substitution  of  an 
estimate  of  of.  For  example,  one  may  use  an  estimate  obtained  from  the  fitting  of  a 
spline  model  in  a  likelihood  implementation.  For  further  discussion  of  prior  choice 
for  of  in  a  spline  context,  see  Crainiceanu  et  al.  (2005).  We  first  illustrate  the  steps 
in  prior  construction  in  a  toy  example,  before  presenting  a  more  complex  example. 


Example:  One-Way  AN OVA  Model 

As  a  simple  non-spline  demonstration  of  the  derived  effective  degrees  of  freedom, 
consider  the  one-way  AN OVA  model: 

—  (3q  T  bi  T-  £ij  ? 

with  bi  |  al  ~ud  N(0,  al)  and  r,;7  |  of  N(0,  al)  for  *  =  1, ,  m  groups  and 
j  =  1, ...  ,n  observations  per  group.  This  model  may  be  written  as  y  =  07  +  e. 
where  c  is  the  nrri  x  (to  +  1)  design  matrix 

1  n  1  n  On  *  *  '  On 
1  n  On  In  *  *  '  0n 

C=  ....  .  , 

_  1  n  0n  0n  •  •  •  \n  _ 

and  7  =  [/3o,  bi, . . . ,  bm  ]T.  The  effective  degrees  of  freedom  are  given  by  (1 1.25), 
with  A  =  al/ al  and  D  a  diagonal  matrix  with  a  single  zero  followed  by  to  ones. 

For  illustration,  assume  to  =  10  and  a^2  ~  Ga(0.5, 0.005).  Figure  11.10 
displays  the  prior  distribution  for  a the  implied  prior  distribution  on  the  effective 
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a 


<Jb 


b 


Effective  Degrees  of  Freedom 


Fig.  11.10  Gamma  prior  for  ab  2  with  parameters  0.5  and  0.005,  for  the  one-way  ANOVA 
example,  (a)  Implied  prior  for  cri,  (b)  implied  prior  for  the  effective  degrees  of  freedom,  and 
(c)  effective  degrees  of  freedom  versus  crj, 


degrees  of  freedom,  and  the  bivariate  plot  of  these  quantities.  For  clarity,  values  of 
<T(,  greater  than  2.5  (corresponding  to  4%  of  points)  are  excluded  from  the  plots. 
In  panel  (c),  we  have  placed  horizontal  lines  at  effective  degrees  of  freedom  equal 
to  1  (complete  smoothing)  and  10  (no  smoothing).  We  also  highlight  the  strong 
nonlinearity.  From  panel  (b),  we  conclude  that  this  prior  choice  favors  quite  strong 
smoothing. 


Example:  Spinal  Bone  Marrow  Density 

We  demonstrate  the  use  of  the  mixed  model  for  nonparametric  smoothing  using 
O’Sullivan  splines,  which,  as  described  in  Sect.  11.2.5,  are  based  on  a  /7-spline 
basis,  and  using  data  introduced  in  Sect.  1.3.6.  Recall  that  these  data  concern 
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age 

Fig.  11.11  Spinal  bone  mineral  density  measurements  versus  age  by  ethnicity.  Measurements  on 
the  same  woman  are  joined  with  gray  lines.  The  bold  solid  curve  corresponds  to  the  fitted  spline, 
and  the  dashed  lines  to  the  individual  fits 


longitudinal  measurements  of  spinal  bone  mineral  density  (SBMD)  on  230  female 
subjects  aged  between  8  and  27  years  and  of  one  of  four  ethnic  groups:  Asian,  Black, 
Hispanic,  and  White.  Let  yi3  denote  the  SBMD  measure  for  subject  i  at  occasion  j, 
for  i  =  1 , ,m  =  230  and  j  =  1, . . . ,  rij  and  with  rr„;  ranging  between  1  and  4. 
Let  N  =  Figure  11.11  shows  these  data  with  joined  points  indicating 

measurements  on  the  same  woman.  For  these  data,  we  would  like  a  model  in  which 
the  response  is  a  smooth  function  of  age  and  in  which  between-woman  variability 
in  response  is  acknowledged.  We  therefore  assume  the  model: 

L 

Vij  ij /3 1  “t"  X  1^2  +  ^  ^  Zijibu  -f-  1)2/  4“  £ij 

1=1 

where  Xij  is  a  1  x  4  vector  containing  an  indicator  for  the  ethnicity  of  individual  i, 
with  f31  the  associated  4x1  vector  of  fixed  effects,  zl3i  is  the  / th  basis  associated 
with  age,  with  associated  parameters  hi  |  ~  N(0,  af)  and  b-2i  \  erf  ~  N(0,cr|) 

are  the  woman-specific  random  effects,  and  |  of  d  N(0,  of)  represent  the 
residual  errors.  All  random  terms  are  assumed  independent.  Note  that  the  spline 
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model  is  assumed  common  to  all  ethnic  groups  and  all  women,  though  it  would  be 
straightforward  to  allow,  for  example,  a  different  spline  for  each  ethnicity.  Let  f3  = 
[(31,l 32  ]T  and  xt  be  the  rii  x  5  fixed  effect  design  matrix  with  j'-th  row  [x,j ,  age,y ] , 
j  =  1, ...  ,rii  (each  row  is  identical  since  age^-  is  the  initial  age).  Also,  let  Zu  be  the 
rii  x  L  matrix  of  age  basis  functions,  b\  =  [&i, . . . ,  6l]t  be  the  vector  of  associated 
coefficients,  Z2;  represent  the  rii  x  1  vector  of  ones,  and  =  [en, . . . ,  ei„JT.  Then 

Vi  =  Xi[3  +  zubi  +  z2ibi  +  et 

and  we  may  write: 

y  =  x/3  +  zibi  +  z2b2  +  e 
=  C7  +  e, 

where  y  =  [yu  . . . ,  ym]T,  x  =  [x±, . . .  ,xm]T,  Zi  =  [zn, . . . ,  zlm]T,  z2  = 
[z2 1,  ■  •  • ,  z-2m]T,  and  b2  =  [b21, ...,  b2m]T. 

We  examine  two  approaches  to  inference,  one  based  on  REML  (Sect.  8.5.3)  and 
the  other  Bayesian,  using  INLA  for  computation.  In  each  case,  to  fit  the  model, 
we  first  construct  the  basis  functions  and  from  these,  the  required  design  matrices. 
Running  the  REML  version  of  the  model,  we  obtain  <r£  =  0.033,  which  we  use 
to  evaluate  the  effective  degrees  of  freedom  associated  with  the  priors  for  each  of 
<j\  and  o\.  We  assume  the  usual  improper  prior,  7r(cr£)  oc  1/af  for  g\.  After  some 
experimentation,  we  settled  on  the  prior  <rf 2  ~  Ga(0.5, 5  x  10-6).  For  a\,  we  desire 
a  90%  interval  for  b2i  of  ±0.3  which,  with  1  degree  of  freedom  for  the  marginal 
distribution,  leads  to  a^  2  ~  Ga(0. 5, 0.00113).  See  Sect.  8.6.2  for  details  on  the 
rationale  for  this  approach.  Figures  1 1.12(a)  and  (d)  shows  the  priors  for  eri  and  <72, 
with  the  priors  on  the  implied  effective  degrees  of  freedom  displayed  in  panels  (b) 
and  (e).  For  the  spline  component,  the  90%  prior  interval  on  the  effective  degrees 
of  freedom  is  [2.4, 10]. Figures  11.12(c)and  (f)  shows  the  relationship  between  the 
standard  deviations  and  the  effective  degrees  of  freedom. 

Table  11.1  compares  estimates  from  REML  and  INLA  implementations  of  the 
model,  and  we  see  close  correspondence  between  the  two.  Figures  1 1.12(a)  and  (d) 
show  the  posterior  medians  for  a\  and  a2,  which  correspond  to  effective  degrees  of 
freedom  of  8  and  214  for  the  spline  model  and  random  intercepts,  respectively,  as 
displayed  on  panels  (b)  and  (e).  The  effective  degrees  of  freedom  of  214  associated 
with  the  random  intercepts  show  that  there  is  considerable  variability  between  the 
230  women  here.  This  is  confirmed  in  Fig.  11.11,  where  we  observe  large  vertical 
differences  between  the  profiles.  This  figure  also  shows  the  fitted  spline,  which 
appears  to  mimic  the  age  trend  in  the  data  well. 


11.3  Kernel  Methods 

We  now  turn  to  another  class  of  smoothers  that  are  based  on  kernels.  Kernel  methods 
are  used  in  both  density  estimation  and  nonparametric  regression,  and  it  is  the  latter 
on  which  we  concentrate  (though  we  touch  on  the  former  in  Sect.  1 1 .3.2).  The  basic 
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Fig.  11.12  Prior  summaries  for  the  spinal  bone  mineral  density  data,  (a)  <ti,  the  standard  deviation 
of  the  spline  coefficients;  (b)  effective  degrees  of  freedom  associated  with  the  prior  for  the  spline 
coefficients;  (c)  effective  degrees  of  freedom  versus  <ri;  (d)  a 2 ,  the  standard  deviation  of  the 
between-individual  random  effects;  (e)  effective  degrees  of  freedom  associated  with  the  individual 
random  effects;  and  (f)  effective  degrees  of  freedom  versus  <72-  The  lower  and  upper  dashed 
horizontal  lines  in  panels  (c)  and  (f)  are  the  minimum  and  maximum  attainable  degrees  of  freedom, 
respectively.  The  vertical  dashed  lines  on  panels  (a),  (b),  (d),  and  (e)  correspond  to  the  posterior 
medians 
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Table  11.1  REML  and 
INLA  summaries  for  the 
spinal  bone  data.  The 
intercept  corresponds  to  the 
Asian  group.  For  the  entries 
marked  with  a  *,  standard 
errors  were  unavailable 


Variable 

REML 

INLA 

Intercept 

0.560  ±  0.029 

0.563  ±  0.031 

Black 

0.106  ±0.021 

0.106  ±0.021 

Hispanic 

0.013  ±  0.022 

0.013  ±  0.022 

White 

0.026  ±  0.022 

0.026  ±  0.022 

Age 

0.021  ±  0.002 

0.021  ±  0.002 

o-i 

0.018* 

0.024  ±  0.006 

02 

0.109* 

0.109  ±0.006 

Oe 

0.033* 

0.033  ±  0.002 

idea  underlying  kernel  methods  is  to  estimate  the  density/regression  function  locally 
with  the  kernel  function  weighting  the  data  in  an  appropriate  fashion.  We  begin  by 
briefly  defining,  and  giving  examples  of,  kernels. 


11.3.1  Kernels 


A  kernel  is  a  smooth  function  K (•)  such  that  I\  (x)  >  0,  with 

J  K(u)du=  1,  J  uK{u)du  =  0,  rr2K  =  J  u2I\{u)du<  oo.  (11.26) 


In  practice,  a  kernel  is  applied  to  a  standardized  variable,  and  so,  in  what  follows, 
we  do  not  include  a  scale  parameter  since  the  standardization  has  removed  the 
dependence  on  scale. 

We  describe  four  common  examples  of  kernel  functions.  The  Gaussian  kernel  is 

K{x)  =  (27r)'1/2exp 


and  is  nonzero  for  all  x,  which  makes  this  kernel  relatively  computationally 
expensive  to  work  with  since  all  points  must  be  considered  in  calculations  for  a 
single  x.  We  describe  three  alternatives  but  first  define 


r  i  if  \x\ 

\  0  if  |x|  >  1 


The  Epanechnikov  kernel  has  the  form 

K(x)  =  -A(l^x2)I(x),  (11.27) 

while  the  tricube  kernel  is 


R{x)  =  —  (i  -  |at|3)3/(at). 


(11.28) 
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x 


Fig.  11.13  Pictorial  representation  of  four  commonly  used  kernels 


Finally,  the  boxcar  kernel  is 


K(x)=l-I(x).  (11.29) 

All  four  kernels  are  displayed  in  Fig.  11.13.  We  first  describe  kernel  density 
estimation,  which  is  a  simple  technique  used  in  a  classification  context  (as  described 
in  Sect.  12.8.3). 


11.3.2  Kernel  Density  Estimation 


Consider  a  random  univariate  sample  x\, . . . ,  xn  from  a  density  p(-).  The  kernel 
density  estimate  (KDE)  of  the  unknown  density,  given  a  smoothing  parameter  A,  is 

<11.30) 

i=l  v  ' 

so  that  the  estimate  of  the  density  at  x  is  potentially  built  upon  contributions  from  all 
n  observed  values,  though  for  the  finite  range  kernels  (11 .27)— (1 1 .29),  the  sum  will 
typically  be  over  far  fewer  points.  Choosing  K(-)  as  a  probability  density  function 
ensures  thatp^^)  is  also  a  density.  We  write  K\(u)  =  X~xK (it/A)  for  a  slightly 
more  compact  notation. 

We  now  informally  state  a  number  of  properties  of  the  kernel  density  estimator. 
A  number  of  regularity  conditions  are  required,  the  most  important  of  which  is 
that  the  second  derivative  p"{x )  is  absolutely  continuous;  Wand  and  Jones  (1995, 
Chap.  2)  contains  more  details.  We  also  assume  the  conditions  on  K (■')  given 
in  (11.26). 
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Since  x\,...,xn  are  a  random  sample  from  p(-),  the  expectation  of  the  density 
estimator  can  be  written  as 

H^)] 

=  Et  [Kx  (: r  -  T)] 

=  j  K\(x  —  t)p(t)  dt,  (11.31) 


which  is  a  convolution  of  the  true  density  with  the  kernel.  Smoothing  has,  therefore, 
produced  a  biased  estimator  whose  mean  is  a  smoothed  version  of  the  true  density. 
Clearly,  we  wish  to  have  A  — >  0  as  n  — >  oo,  so  that  the  kernel  concentrates  more 
and  more  on  x  with  increasing  n,  ensuring  that  the  bias  goes  to  zero. 

We  write  A„  to  emphasize  the  dependence  on  n.  It  is  straightforward  to  show 
that,  asn->  oo,  with  \n  — >  0  and  n\n  —>  oo: 


E 


=  P(X)  +  \\2np"  (X)<J2K  +  0(  Xl) 


so  that  the  estimator  is  asymptotically  unbiased. 
Proof.  Withp^A")(:r)  given  by  (1 1.30), 


E[p^A")(a;)]  =  J  K\n(x  —  t)p[t)dt 
=  J  K(u)p(x  —  Xnu)du 


=  /  K{u) 


p{x)  -  A nup'{x)  +  ^-p'\x)  +  ... 


du 


=  p(x)  +  yP"(x)<J2k  +  o(  Xl).  □ 

The  bias  is  large  whenever  the  absolute  value  of  the  second  derivative  is  large.  In 
peaks,  p"{x)  <  0,  and  the  bias  is  negative  since  p^Xn^(x)  underestimates  p{x),  and 
in  troughs,  the  bias  is  positive  as  p^A")( x)  overestimates  p[x). 

Via  a  similar  calculation, 


var 


~—p(x)K2  +  o 

Ti\n 


where  K2  =  J  K(u )2  du  and  n\n  is  a  “local  sample  size’’  (so  that  larger  A„  gives 
a  larger  effective  sample  size).  The  variance  is  also  proportional  to  the  height  of  the 
density.  Overall,  as  A„  decreases  to  zero,  the  bias  diminishes,  while  the  variance 
increases,  with  the  opposite  behavior  occurring  as  A„  increases.  The  combined 
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effect  is  that,  in  order  to  obtain  an  estimator  which  converges  to  the  true  density, 
we  require  both  Xn  and  1  / nXn  to  decrease  as  sample  size  increases. 

As  discussed  in  Sect.  10.4,  the  accuracy  of  an  estimator  may  be  assessed  by 
evaluating  the  mean  squared  error  (MSE).  Forp^A")(:r), 


MSE 


p^Xn\x) 


=  E 


=  bias 


(fiXnXx)  -p(x)j 


ft^Xx) 


T”"^^  +  Xzp{x),<2' 


(11.32) 


where  the  expectation  in  (11.32)  is  over  the  uncertainty  in  p:x'n  >  (x),  that  is,  over 
the  sampling  distribution  of  X\, . . . ,  Xn. 

Averaging  the  MSE  over  x  gives  the  integrated  mean  squared  error 


IMSE 


=  /  MSE 


dx 


ftx"\x) 

[  p"(x)2dx  +  -4-^2-  (11.33) 

4  J  nXn 


If  we  differentiate  (1 1.33)  with  respect  to  Xn  and  set  equal  to  zero,  we  obtain  an 
asymptotic  optimal  bandwidth  of 


A 


★ 

n 


K-2 

f  p"(x)2dx 


1/5 


(11.34) 


This  formula  is  useful  since  it  informs  us  that  the  optimal  bandwidth  decreases  at 
rate  n~x'5.  Then,  substitution  in  (11.33)  shows  that  the  IMSE  is  of  0(n~4'5).  It 
can  be  shown  that  there  does  not  exist  any  estimator  that  converges  faster  than  this 
rate,  assuming  only  the  existence  of  second  derivatives,  p";  for  more  details,  see 
Chap.  24  of  van  der  Vaart  (1998).6 

We  turn  now  to  a  discussion  of  estimation  of  the  amount  of  smoothing  to 
carry  out,  that  is,  how  to  estimate  the  optimal  A„.  So-called  "plug-in”  estimators 
substitute  estimates  for  unknown  quantities  (here  the  integrated  squared  second 
derivative  in  the  denominator)  in  order  to  evaluate  A*.  If  we  assume  that  />(■)  is 
normal  in  (1 1.34),  we  obtain 


A*  =  (4/3)1/5  x  an-1'5, 


(11.35) 


where  a  is  the  standard  deviation  of  the  normal. 


6The  histogram  estimator  converges  at  rate  0(n  2/3);  see,  for  example.  Wand  and  Jones  (1995, 
Sect.  2.5). 
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Leave-one-out  cross-validation  may  be  used  to  choose  Xn  in  order  to  minimize  a 
measure  of  estimation  accuracy.  One  convenient  quantity  that  may  be  minimized  is 
the  integrated  squared  error  (ISE),  defined  as 


ISE 


=/>«(*) -pM]2* 

=  J p^Xn\x)2  dx  —  2  J  p{x)pf'Xn\x)  dx  +  J p(x)2  dx. 


The  last  term  does  not  involve  Xn,  and  the  other  terms  can  be  approximated  by 


1 

n 


(Xi), 


where  p^X"\x)  is  the  estimator  constructed  from  the  data  without  observation  Xi. 
The  use  of  normal  kernels  gives  a  very  convenient  form  for  estimation,  as  described 
by  Bowman  and  Azzalini  (1997,  p.  37). 


11.3.3  The  Nadaraya-Watson  Kernel  Estimator 


We  now  turn  to  nonparametric  regression  and  estimation  of 

f(x)  =  E  [Y  |  x\ 

=  J  yp(y  I  x)  dy 

1  r 

yp{x,  y)  dy. 


P{x) 


Suppose  we  estimate  p(x.  y)  by  the  product  kernel 


=  1  ^  K, 


nXxXy  ^  \  Xx  J  \  Xy 


and  p(x)  by 


p{xx)(x)=^-  itK< 


t  V 

z=l 


X  —  Xi 


(11.36) 


Substitution  of  these  estimates  in  (11.36)  gives  the  Nadaraya-Watson  kernel 
regression  estimator  (Nadaraya  1964;  Watson  1964): 
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n\x\y  £i=l  /  V^x  1 

f  X  —  Xi 

\  A* 

)  Ky 

(  y-yi\ 

\  xy  J 

1  dy 

1 

n\x 

Kx  f 

X  —  Xi  \ 

A,  ) 

EIU  K *  /(ffc  +  u\y)Ky(u)  du 

sr=i^(^)yi 

ELiK(x-^) 


(11.37) 


where  we  have  used  f  Ky[u )  du  =  1  and  f  uKy{u)  du  =  0.  We  also  write  A  =  A^ 
and  Kx  =  K  in  the  final  line.  This  estimator  may  be  written  as  the  linear  smoother: 


i=l 


where  the  weights 


^A)(*) 


are  dehned  as 


S\X\x)  = 


K  (*=*■) 


As  a  special  case,  a  rectangular  window  (i.e.,  the  boxcar  kernel)  produces  a  smoother 
that  is  a  simple  moving  average.  As  with  spline  models,  the  choice  of  the  smoothing 
parameter  A  is  crucial  for  reasonable  behavior  of  the  estimator. 

We  now  examine  the  asymptotic  IMSE  which,  as  usual,  can  be  decomposed  into 
contributions  due  to  squared  bias  and  variance.  An  advantage  of  local  polynomial 
regression  estimators  is  that  the  form  of  the  bias  and  variance  is  relatively  simple, 
thus  enabling  analytic  study.  For  the  subsequent  calculations,  and  those  that  appear 
later  in  this  chapter,  we  state  results  without  regularity  conditions.  See  Fan  (1992, 
1993)  for  a  more  rigorous  treatment. 

As  Xn  — >  0  and  nXn  — >  oo,  the  bias  of  the  Nadaray a- Watson  estimator  at  the 
point  x  is 


bias 


\2_2 

AnaI< 


/"(*)  +  2  f\x 


P'{x) 

p{x) 


(11.38) 


where  p{x)  is  the  true  but  unknown  density  of  x.  The  bias  increases  with  increasing 
Xn  as  we  would  expect.  The  bias  also  increases  at  points  at  which  /(•)  increases 
in  “wiggliness”  (i.e.,  large  f"(x))  and  where  the  derivative  of  the  "design  density,” 
p'(x),  is  large.  The  so-called  design  bias  is  defined  as  2  f(x)p'(x)/p(x)  and,  as  we 
will  see  in  Sect.  11.3.4,  may  be  removed  if  locally  linear  polynomial  models  are 
used. 
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The  variance  at  the  point  x  is 


var 


K2(T2  1 

n\„  p(x)’ 


(11.39) 


where  we  have  assumed,  for  simplicity,  that  the  variance  a1  =  var(F  |  x )  is 
constant.  The  variance  of  the  estimator  decreases  with  decreasing  measurement 
error,  increasing  density  of  x  values,  and  increasing  local  sample  size  n\n. 
Consequently,  we  see  the  “usual”  trade-off  with  small  A  reducing  the  bias  but 
increasing  the  variance.  Combining  the  squared  bias  and  variance  and  integrating 
over  x  gives  the  IMSE: 


IMSE 


\4_4 
An  aK 


/"Or)  +  2/'0r) 


p'jx) 

p{x) 


dx  - 


K2c t2 

n\n 


1 


dx. 


p(x) 

(11.40) 


If  we  differentiate  this  expression  and  set  equal  to  zero,  we  obtain  the  optimal 
bandwidth  as 

( _ <j2k2Jp(x)-Ux _ \1/:’ 

\<7Kf(f"(x)  +  2 f'(x)p'(x)/p(x))2  dx) 

so  that  A*  =  0(n-1/5).  Plugging  this  expression  into  (1 1.40)  shows  that  the  IMSE 
is  0(n-4/5),  which  holds  for  many  nonparametric  estimators  and  is  in  contrast 
to  most  parametric  estimators  whose  variance  is  0(n_1).  The  loss  in  efficiency  is 
the  cost  of  the  flexibility  offered  by  nonparametric  methods.  Expression  (11.41) 
depends  on  many  unknown  quantities,  and  while  there  are  “plug-in”  methods  for 
estimating  these  terms,  a  popular  approach  is  cross-validation. 


11.3.4  Local  Polynomial  Regression 


We  now  describe  a  generalization  of  the  Nadaraya-Watson  kernel  estimator, 
local  polynomial  regression,  with  improved  theoretical  properties.  Let  Wi(x)  = 
K[{xi  —  re) / A]  be  a  weight  function  and  choose  Ate  =  f(x)  to  minimize  the 
weighted  sum  of  squares 

n 

j>(*)  (Yi  -  Ate)2 

2=1 


with  solution 


fix)  =  Ate 


ELi  Wi{x)Yj 
Er=i  wi(x) 


showing  that  the  Nadaraya-Watson  kernel  regression  estimator  (11.37)  corresponds 
to  a  locally  constant  model,  estimated  using  weighted  least  squares.  For  notational 
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simplicity,  we  have  not  acknowledged  that  the  weight  Wi(x)  depends  on  the 
smoothing  parameter  A.  We  emphasize  that  we  carry  out  a  separate  weighted  least 
squares  fit  for  each  prediction  that  we  wish  to  obtain. 

This  formulation  suggests  an  extension  in  which  a  local  polynomial  replaces  the 
locally  constant  model  of  the  Nadaray a- Watson  kernel  estimator.  For  values  of  u  in 
a  neighborhood  of  a  fixed  x,  define  the  polynomial: 

px(u;(3x)  =  /3ox  +Pix{u-  x)  +  ^f(u-x)2  +  . . .  +  ^(u-x)p, 

2!  p! 

with  f3x  =  [  So x , ,  0pX  ]  •  The  idea  is  to  approximate  /  in  a  neighborhood  of  x 
by  the  polynomial  Px(u ;  /T,.).7  The  parameter  f3x  is  chosen  to  minimize  the  locally 
weighted  sum  of  squares: 

n 

5>(*)  Wi-Pxfafi,)]2.  (11.42) 

2=1 

The  ensuing  local  estimate  of  /  at  u  is 

f(u)  =  Px(«;3x)- 

We  could  use  this  estimate  in  a  local  neighborhood  of  x,  but  instead,  we  fit  a  new 
local  polynomial  for  every  target  x  value.  At  a  target  value  u  =  x, 

f(x)  =  Px(x-,PX)  =  0Ox- 

The  weight  function  is  w{x0)  =  K[(xi  —  a;) /A],  so  that  the  level  of  smoothing  is 
controlled  by  the  smoothing  parameter  A,  with  A  =  0  resulting  in  f(xi )  =  y,  and 
A  =  oo  being  equivalent  to  the  fitting  of  a  linear  model.  It  is  important  to  emphasize 
that  f{x)  only  depends  on  the  intercept  0qx  of  a  local  polynomial  model,  but  should 
not  be  confused  with  the  fitting  of  a  locally  constant  model. 

For  estimating  the  function  /  at  the  point  x,  local  regression  is  equivalent  to 
applying  weighted  least  squares  to  the  model: 

Y  =  xxf3x  +  ex ,  (11.43) 

with  E^]  =  0,  var(ea;)  =  a2W~x, 

1  X\ 

1  X-I 


representing  the  nx  (p+  1)  design  matrix  and  Wx  the  n  x  n  diagonal  matrix  with 
elements  Wi(x),  i  =  1 , ,n.  Large  values  of  Wi  correspond  to  x  —  Xi  being  small. 


—  x  ■  • 


—  X  ■  ■ 


(xi-x)p 

p\ 

(x2-x)P 


7This  approximation  may  be  formally  motivated  via  a  Taylor  series  approximation  argument. 
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so  that  data  points  Xi  close  to  x  are  most  influential.  With  the  finite  range  kernels 
described  in  Sect.  11.3.1,  some  of  the  Wi(x)  elements  will  be  zero,  in  which  case 
we  would  only  consider  the  data  with  nonzero  elements  within  (11.43).  Note  that 
Wx  depends  on  the  kernel  function,  K {■),  and  therefore  upon  the  bandwidth,  A. 
Minimization  of 

(Y  -  Xx(3xyWx(Y  -  xxpx) 

gives 

3  x  =  (xlWxxx)~1xlWxY.  (11.44) 

Taking  the  inner  product  of  the  first  row  of  ( x'x Wxxx ) “ 1  x'x Wx  with  Y  gives 

f(x)  =  Pox- 

From  (1 1.44),  it  is  clear  that  this  estimator  is  linear  in  the  data: 

n 

f{x)  =  YJSiX\x)Yl. 

2=1 

This  estimator  has  mean 

n 

E[/(z)]  =^2SiX\x)f(xi) 

2=1 

and  variance 


var 


i= 1 


where  we  have  again  assumed  the  error  variance  is  constant  and  that  the  observations 
are  uncorrelated.  The  effective  degrees  of  freedom  can  be  defined  as  =  t r ( S' 1 A ' ) 

where  is  the  “hat”  matrix  determined  from  Y  =  S^Y. 

Asymptotic  analysis  suggests  that  local  polynomials  of  odd  degree  dominate 
those  of  even  degree  (Fan  and  Gijbels  1996),  though  Wand  and  Jones  (1995)  empha¬ 
size  that  the  practical  implications  of  this  result  should  not  be  overinterpreted.  Often 
p  =  1  will  be  sufficient  for  estimating  /(•).  It  can  also  be  shown  (Exercise  1 1.6)  that 
with  a  linear  local  polynomial,  we  obtain 


_  Er=i“i(®)yi  .  /  =  ^£?=1  Wi(x)(xi  -  xw)Yi 

J  Vx/  v  > ti  /  \  i  Vx  )  sr^n  /  w  —  \o  5 

2^=1  Wi\X)  2^2=1  Wi(X){Xi  -  XwY 

where  xw  =  Wi(x)xi/Y^i=  1  wi(x)  and  Wi(x)  =  K(( x  —  xi)/ A).  Therefore, 

the  estimator  is  the  locally  constant  (Nadaray a- Watson)  estimator  plus  a  term  that 
corrects  for  the  local  slope  and  skewness  of  the  x, . 

For  the  linear  local  polynomial  model,  we  have 


f(x)  +  ^Xlf"(x)a 


2 

K 
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Fig.  11.14  Local  linear 
polynomial  fits  to  the  LIDAR 
data,  with  three  different 
kernels.  The  fits  are 
indistinguishable 
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Proofs  of  these  expressions  may  be  found  in  Wand  and  Jones  (1995,  Sect.  5.3). 
Notice  that  the  bias  is  dominated  by  the  second  derivative,  which  is  reflecting  the 
error  in  the  linear  approximation.  If  /  is  linear  in  x,  then  /  is  exactly  unbiased. 

For  the  local  linear  polynomial  estimator, 


IMSE 


=  bias 


/( A") 


var 


\4  4 
AnaK 


f"(x)2  dx 


K2C T2 

Ti\rn 


p{x) 


dx. 


In  comparison  with  (11 .40),  the  design  bias  is  zero,  showing  a  clear  advantage  of  the 
linear  polynomial  over  the  Nadaray a- Watson  estimator.  The  optimal  A  is  therefore 


=  ,  1  \  1/5  f  a2I<2jp(x)  xdx \  1/5 
nj  V  afcf  f"(x)2dx  ) 


(11.45) 


Each  of  the  terms  in  expression  (11.45)  can  be  estimated  to  give  a  “plug-in” 
estimator  of  An,  or  cross-validation  may  be  used.  Since  the  local  polynomial 
regression  estimator  is  a  linear  smoother,  inference  for  this  model  follows  as  in 
Sect.  11.2.7. 
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Example:  Light  Detection  and  Ranging 

Figure  11.14  shows  scatterplot  smoothing  of  the  LIDAR  data  using  local  linear 
polynomials  and  Gaussian,  tricube  and  Epanechnikov  kernels.  In  each  case  the 
smoothing  parameter  is  chosen  via  generalized  cross-validation,  as  described  in 
Sect.  10.6.3.  The  choice  of  kernel  is  clearly  unimportant  in  this  example. 


11.4  Variance  Estimation 

Accurate  inference,  for  example,  confidence  intervals  for  f{x)  at  a  particular  x, 
depends  on  accurate  estimation  of  the  error  variance,  which  may  be  nonconstant. 

We  begin  by  assuming  that  the  model  is 

Ui  =  E [Yi  |  Xi]  +ei  =  f(xi)  +  €i, 

with  var(ei  |  Xi)  =  a2  and  cov(ej,  |  Xi,Xj)  =  0.  We  have  made  the  crucial,  and 
strong,  assumption  that  the  errors  have  constant  variance  (i.e.,  are  homoscedastic) 
and  are  uncorrelated.  We  assume  a  linear  smoother  so  that  f  =  SY  with  p  =  tr(S') 
the  effective  degrees  of  freedom  and  suppressing  the  dependence  on  the  smoothing 
parameter. 

The  expectation  of  the  residual  sum  of  squares  is 

E[(Y  -  fY(Y  -  /)]  =  E[(Y  -  SYY(Y  -  SY)] 

=  E[YT(I  -  S)T{I  -  S)Y] 

=  f{I  -  SY(I  -  S)f  +  tr  [(/  -  SY(I  -  S)Ia2] 
using  identity  (B.4)  from  Appendix  B 
=  f(I  -  sy(l  -  S)f  +  <r2tr(J  -  S'  -S  +  STS) 

=  f(I  -  S)T(I  -  S)f  +  a2(n-2p  +  p) 


where 

p  =  tr(STS). 

The  bias  is 

f-E[f]  =  f-  5E[V]  =  f  —  Sf  =  (I  —  S)f . 

Therefore, 

2  ,  fT(i-sy(i-s)f 

<7  H - - - ^ - 

n  —  Zp  +  p 


RSS 


n  —  2p  +  p 
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with  the  second  term  being  the  sum  of  squared  bias  terms  divided  by  a  particular 
form  of  degrees  of  freedom.  If  the  second  term  is  small,  it  may  be  ignored  to  give 
the  estimator: 


£ILi  (Yi-fjxi))2 
n  —  2p  +  p 


(11.46) 


Notice  that  for  idempotent  S,  we  have  STS  =  S,  p  =  p,  and  (11.46)  results  in 
an  estimator  with  a  more  familiar  form,  that  is,  with  denominator  n  —  p  with  p  the 
effective  degrees  of  freedom. 

We  now  derive  an  alternative  local  differencing  (method  of  moments)  estimator 
(Rice  1984).  We  begin  by  considering  the  expected  differences: 

E  [(Yj+r  -  Yi)2]  =  E  [(/<+!  +  ei+i  -ft-  e,)2] 

=  (fi+i  -  fi)2  +  E  [(ei+1  -  £i)2]  (11.47) 

=  (/,+!-  fi?  +  2a2  (11.48) 

for  i  =  1, . . . ,  n  —  1.  If  fi+ 1  «  ft,  then  E[(Yi+i  -  Yl)2}  w  2cr2,  leading  to 


n—  1 


2(n  —  1) 


—r^J2iyi+i  ~  Vi)2 


(11.49) 


This  estimator  will  be  inflated,  as  is  clear  from  (11.48).  An  improved  method 
of  moments  estimator,  proposed  by  Gasser  et  al.  (1986),  is  based  on  weighted 
second  differences  of  the  data.  Specifically,  first  consider  the  line  joining  the  points 
Ui-i]  and  [xj+i,  yi+i\-  This  line  is  obtained  by  solving 


Vi+ 1  —  O'/  T 

Vi—  1  —  4“  (3iXi—  1, 


to  give 


CXi  — 


Xi-\-i  Xi—  i 


A 


Vi+1  Vi— l 

*^i+l  %i—  1 


Ci  =  ai  +  Axi  -  yi 

=  dipi-i  +  bipi+i  -  yi, 


Define  a  pseudo-residual  as 
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where 


1 

CLi  =  - 

1  %i—  1 


bi  = 


Xi  Xi—  1 


*G+1  1 


Gasser  et  al.  (1986)  show  that  var(?i)  =  [a2 +&2 +l]<r2+0(n-2)  (the  final  term  here 
is  required  because  the  pseudo-residuals  do  not  have  mean  zero).  We  are  therefore 
led  to  the  estimator: 


a 


2 


1 

n  —  2 


n—1 

i=2 


(11.50) 


where  c2  =  (a?  +  b2  +  1)_1,  for  i  =  2, . . . ,  n.  Note  that  the  variance  estimators 
( 1 1 .49)  and  (11.50)  depend  only  on  (yi ,  Xi ) ,  *  =  1 , . . . ,  n  and  not  on  the  model  that 
is  fitted. 

Now  suppose  we  believe  the  data  exhibit  nonconstant  variance  ( heteroscedas - 
ticity).  If  the  variance  depends  on  /( x)  via  some  known  form,  for  example, 
a2(x )  =  cr2f(x),  then  quasi-likelihood  (Sect.  2.5)  may  be  used.  Otherwise, 
consider  the  model: 

Yi  =  f(xi)  +a(xi)ei, 

with  life,]  =  0  and  varfe,)  =  1.  Since  the  variance  must  be  positive,  a  natural  model 
to  consider  is 


Zi  =  log 


(Yi  -  f(Xi))2 


log  [a2{Xi)]  +log(e2) 


=  g(xi )  +  Si, 


(11.51) 


where  Si  =  log(ef).  A  simple  approach  to  implementation  is  to  first  estimate 
/(•)  under  the  assumption  of  constant  variance,  obtain  fitted  values,  and  then  form 
residuals.  One  may  then  estimate  ij  ( ■ )  using  a  nonparametric  estimator  to  produce 
tr(a;)2  =  exp  [ <7(2;)  ],  for  i  =  1 ,n.  Subsequently,  confidence  intervals  may  be 
constructed  based  on  a(x).  For  further  details,  see  Yu  and  Jones  (2004).  A  more 
statistically  rigorous  approach  would  simultaneously  estimate  /(•)  and  g(-). 


Example:  Light  Detection  and  Ranging 

Using  the  natural  cubic  spline  fit,  the  variance  estimate  based  on  the  residual  sum 
of  squares  (11.46)  is  0.0802.  The  estimates  based  on  the  first  differences  (11.49) 
and  second  differences  (1 1.50)  are  0.0822  and  0.0832,  respectively.  In  this  example, 
therefore,  the  estimates  are  very  similar  though  of  course  for  these  data,  the  variance 
of  the  error  terms  is  clearly  nonconstant.  In  Fig.  11.15(a),  we  plot  the  residuals 
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(from  a  natural  cubic  spline  fit)  versus  the  range.  To  address  the  nonconstant  error 
variance,  we  assume  a  model  of  the  form  (11.51).  Figure  11.15(b)  plots  the  log 
squared  residuals  Zi,  as  defined  in  (1 1.51),  versus  the  range.  Experimentation  with 
smoothing  models  for  Zi  indicates  that  a  simple  linear  model 


E [Zi  |  Xi]  =  «o  +  cnXi 


is  adequate,  and  this  is  added  to  the  plot.  Figure  11.15(c)  plots  the  estimated 
standard  deviation,  a(x)  =  y/exp  (So  +  Six),  versus  x,  and  Fig.  11.15(d)  shows 
the  standardized  residuals: 

Vi  -  f(xi) 

a(xi) 

versus  Xi.  We  see  that  the  spread  is  constant  across  the  range  of  xn,  suggesting  that 
the  error  variance  model  is  adequate. 


11.5  Spline  and  Kernel  Methods  for  Generalized  Linear 
Models 

So  far  we  have  considered  models  of  the  form,  Y  =  /( x)  +  e,  with  independent  and 
uncorrelated  constant  variance  errors  e.  We  outline  the  extension  to  the  situation  in 
which  generalized  linear  models  (GLMs,  Sect.  6.3)  are  appropriate  in  a  parametric 
framework.  To  carry  out  flexible  modeling,  penalty  terms  or  weighting  may  be 
applied  to  the  log-likelihood  and  smoothing  models  (e.g.,  based  on  splines  or 
kernels)  may  be  used  on  the  linear  predictor  scale. 

Recall  that,  for  a  GLM,  E[Y?:  |  Oi,  a]  =  b'(9i)  =  m,  with  a  link  function  g(gi) 
and  a  variance  function  var(T)  |  Oi,a)  =  ab"  (6 1)  =  aVi.  In  a  smoothing  context, 
we  may  relax  the  linearity  assumption  and  connect  the  mean  to  the  smoother  via 
g(pi)  =  f{xi).  The  log-likelihood  for  a  GLM  is 

1(0)  =  J21^  =  it  Vi9i~b^  +c(yi,a).  (11.52) 

z — '  z /  a 


11.5.1  Generalized  Linear  Models  with  Penalized  Regression 
Splines 

Let  1(f)  denote  the  log-likelihood  corresponding  to  the  smoother  f(xi),  i  = 
1 , ...  ,n.  Maximizing  over  all  smooth  functions  /(•)  is  not  useful  since  there  are  an 
infinite  number  of  ways  to  interpolate  the  data.  Consider  a  regression  spline  model 
on  the  scale  of  the  canonical  link: 
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Range  (m) 


b 


Range  (m) 


Range  (m)  Range  (m) 


Fig.  11.15  Examination  of  heteroscedasticity  for  the  L1DAR  data.  In  all  plots,  the  range  is  plotted 
on  the  m-axis,  and  on  the  y-axis  we  have  (a)  residuals  from  a  natural  cubic  spline  fit  to  the  response 
data;  (b)  log  squared  residuals,  with  a  linear  fit;  (c)  the  estimated  standard  deviation  cf^x);  and 
(d)  standardized  residuals 


L 

Oi  =  f(xi)  =  /3 o  +  f3iXi  +  . . .  +  PpX?  +  ^2  bi(xi  -  &)+ 

1=1 


=  Xi(3  +  Zib, 


with  penalty  term  XbTDb,  where  D  denotes  a  known  matrix  that  determines  the 
nature  of  the  penalization,  as  in  Sect.  11.2.5.  For  example,  an  obvious  form  is 
A  J  f"(t)2  dt.  As  in  Sect.  11.2.8,  we  may  write  /(x* )  =  C7  with  c  =  [x,z]  and 
7  =  [/3,  b]T,  and  D  =  diag(Op+i,  1^)  to  give  penalty  X^Dj.  To  extend  the 
penalized  sum  of  squares  given  by  (11.21),  consider  the  penalized  log-likelihood 
which  adds  a  penalty  to  (1 1.52)  to  give 
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lp( 7)  =Kl)-  A7T£>7, 


where 


u  \  y^i  ~  b(@i)  i  /  \ 

Kir)  =  - \-c(yi,a), 

L '  rv 


i=  1 

and  (l,  =  cy. 

For  known  A,  the  parameters  7  can  be  estimated  as  the  solution  to 

=  y  dfM  Vi  —  flj  _  2A  =  Q 

di 3  jr[  dij  aVi 
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(11.53) 

(11.54) 


To  find  a  solution,  a  hybrid  of  IRLS  (as  described  in  Sect.  6.5.2)  termed  the 
penalized  IRLS  (P-IRLS)  algorithm  can  be  used.  At  the  fth  iteration,  we  minimize 
a  penalized  version  of  (6.16): 

(z(t)  ~  x^yw{t)  {z{t)  -  *7)  +  A7t£>7,  (11.55) 

where,  as  in  the  original  algorithm,  is  the  vector  of  pseudo-data  with 


=  rr-.-vW 


(tL  dr]i 


=  xa  «  + 


and  is  a  diagonal  matrix  with  elements: 

(^i/*7i|7u)) 


Wi  = 


aVi 


The  iterative  strategy  therefore  solves  (11.55)  using  the  current  versions  of  2 
andW. 

We  define  an  influence  matrix  for  the  working  penalized  least  squares  problem 
at  the  final  step  of  the  algorithm  as  S,x>  =  x(x'Wx  +  XD)  _1  x'W .  The  effective 
degrees  of  freedom  is  then  defined  as  p ^  =  tr  ) . 

So  far  as  inference  is  concerned,  7  is  asymptotically  normal  with  mean  E  [7]  and 
variance-covariance  matrix: 


a(xTWx  +  A  D)~1xWx{x1Wx  +  AD)-1. 


For  more  details,  see  Wood  (2006,  Sect.  4.8). 
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Fig.  11.16  Penalized  cubic 
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Example:  Bronchopulmonary  Dysplasia 


We  illustrate  GLM  smoothing  using  the  data  introduced  in  Sect.  7.2.3,  which  consist 
of  binary  responses  (BPD)  Y.-L  along  with  birth  weights  We  consider  a  logistic 
regression  model: 

Yi  |  p(xi)  ~ind  Binomial  [ni,p(xi)  ] ,  (11.56) 

with 

log  (rrrfk))  =  fix,)-  (1L57) 

The  log-likelihood  is 


l  (/)  =  2 Jifixi)  -  Hi  log  {  1  +  exp  [ f(xi )]  }  . 


A  penalized  spline  model  assumes 


L 

f(Xi)  =  /?0  +  Pl%i  +  ■  ■  ■  +  PpX1}  +  ^2  bl(Xi  -  Cl)  + 

1=1 


=  x,f3  +  Zib. 


The  predicted  probabilities  are  therefore 

_  exp(a;.t/3  +  Zjb) 

P  X  1  +  exp(xj/3  +  Zib) ' 

Figure  11.16  displays  the  data  along  with  three  fitted  curves.  The  linear  logistic 
model  is  symmetric  in  the  tails,  which  appears  overly  restrictive  for  these  data.  We 
fit  a  penalized  cubic  spline  model  (1 1.53)  with  L  =  10  knots  using  P-IRLS  and  pick 
the  smoothing  parameter  using  AIC.  For  this  model. 
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AIC(A)  =  -21  (/(A))  +2  p(X\ 

where  we  have  now  explicitly  written  fix)  as  a  function  of  the  smoothing  parameter 
A  and  is  the  effective  degrees  of  freedom.  Figure  11.16  gives  the  resultant  fit, 
which  has  an  effective  degrees  of  freedom  of  3.0.  It  is  difficult  to  determine  the 
adequacy  of  the  fit  with  binary  data,  but  in  terms  of  smoothness  and  monotonicity, 
the  curve  appears  reasonable.  Notice  that  the  behavior  for  high  birthweights  is  quite 
different  from  the  linear  logistic  model. 


11.5.2  A  Generalized  Linear  Mixed  Model  Spline 
Representation 


The  regression  spline  model  described  in  Sect.  11.5.1  has  an  equivalent  specification 
as  a  generalized  linear  mixed  model  (Sect.  9.3)  with  the  assumption  that  6/  | 
cr2  N(0,  <r2),  l  =  1, . . . ,  L.  The  latter  random  effects  distribution  penalizes 
the  truncated  basis  coefficients. 

For  a  GLM  with  canonical  link,  maximization  of  the  penalized  log-likelihood 
(1 1.53)  is  then  equivalent  to  maximization  of 


1 

a 


{yz(xi(3  +  Zib)  -  b(xi(3  +  ztb )  +  a  x  c(yi:a)}  -  bTb 


(11.58) 


with  respect  to  (3  and  b,  for  fixed  a,  of  In  practice,  estimates  of  a,  o 2  will  also  be 
required  and  will  determine  the  level  of  smoothing.  As  discussed  in  Chap.  9,  rather 
than  maximize  (1 1.58)  as  a  function  of  both  (3  and  b,  an  alternative  is  to  integrate 
the  random  effects  b  from  the  model  and  then  maximize  the  resultant  likelihood. 
This  approach  is  outlined  for  the  case  of  a  binomial  model. 

The  likelihood  as  a  function  of  (3  and  cr2  is  calculated  via  an  L-dimensional 
integral  over  the  random  effects  b: 

L{(3 ,  °&)  =  n  ^ ^  exp  {yi  (xi(3  +  Zib)  -  m  log  [1  +  exp(a;l/3  +  z,,b)]} 
x  (27ro-g)"L/2exp  db 

and  may  be  maximized  to  find  (3  and  afr  For  implementation,  some  form  of 
approximate  integration  strategy  must  be  used;  various  approaches  are  described 
in  Chap.  9.  The  latter  also  contains  details  on  how  the  random  effects  6;  may  be 
estimated  (as  required  to  produce  the  fitted  curve),  as  well  as  Bayesian  approaches 
to  estimation.  Under  the  mixed  model  formulation,  smoothing  parameter  estimation 
is  carried  out  via  estimation  of  a%.  Maximizing  jointly  for  f3  and  b  is  formally 
equivalent  to  penalized  quasi-likelihood  (Breslow  and  Clayton  1993);  see  Chaps.  10 
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and  1 1  of  Ruppert  et  al.  (2003)  for  the  application  of  penalized  quasi-likelihood  to 
spline  modeling. 

Inference  from  a  likelihood  perspective  may  build  on  mixed  model  theory,  as 
described  in  Chap.  9  (see  also,  Ruppert  et  al.  2003,  Chap.  1 1).  A  Bayesian  approach 
can  be  implemented  using  either  INLA  or  MCMC,  both  of  which  are  described  in 
Chap.  3. 


11.5.3  Generalized  Linear  Models  with  Local  Polynomials 

The  extension  of  the  local  polynomial  approach  of  Sect.  11.3.4  to  GLMs  is  rela¬ 
tively  straightforward  with  a  locally  weighted  log-likelihood  replacing  the  locally 
weighted  sum  of  squares  (11.42).  Recall  that  for  the  ith  data  point,  the  canonical 
parameter  is  9i  =  xtf3  (Sect.  6.3).  The  local  polynomial  replaces  the  linear  model 
in  (9j  so  that  we  have  a  local  polynomial  on  the  linear  predictor  scale.  We  write  the 
log-likelihood  for  (3  as 


m  =  YJl\yMP)]- 


To  obtain  the  fit  at  the  point  x  under  a  local  polynomial  model,  we  maximize  the 
locally  weighted  log-likelihood: 


n 


^  ^  (x )  lx  \lji}  Px{Xi7  /3)]  , 


where  Wi(x)  =  K[(xi  —  at)/A]  and  Px(xi7  f3)  is  the  local  polynomial  with 
parameters  (3.  Our  notation  also  emphasizes  that  the  likelihood  is  constructed  for 
each  point  x  at  which  a  prediction  is  desired.  The  local  likelihood  score  equations 
are  therefore 


Once  we  have  performed  estimation,  the  estimate  (on  the  transformed  scale)  for  x  is 
evaluated  as  /3o-  This  method  is  often  referred  to  as  local  likelihood.  The  existence 
and  uniqueness  of  estimates  are  discussed  in  Chap.  4  of  Loader  (1999).  For  a  GLM, 
an  iterative  algorithm  is  required;  Chap.  1 1  of  Loader  (1999)  gives  details  based  on 
the  Newton-Raphson  method.  We  stress  that  the  equations  are  solved  at  all  locations 
x  for  which  we  wish  to  obtain  the  fit.  The  smoothing  parameter  may  again  be  chosen 
in  a  variety  of  ways,  with  cross-validation  being  an  obvious  approach. 


Example:  Bronchopulmonary  Dysplasia 


Returning  to  the  BPD/birthweight  example,  local  log-likelihood  fitting  at  the  point 
x  is  based  on 
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n 

lx{0)  =  y ^/Wi(x)rii 

1=1 


—Px(xi;(3)  —  log  {1  +  exp  [Px(xi\  /3)]} 

rii 


(11.59) 


with  Wi(x)  =  K[(xi  —  x)/h\.  Writing  the  likelihood  in  this  form  emphasizes  that 
Wi(x)rii  is  acting  as  a  local  weight. 

Figure  11.16  shows  the  local  linear  likelihood  fit  with  a  tricube  kernel  and 
smoothing  parameter  chosen  by  minimizing  the  A1C.  The  latter  produces  a  model 
with  effective  degrees  of  freedom  of  4.1.  The  local  likelihood  cubic  curve  bears 
more  resemblance  to  the  penalized  cubic  spline  curve  than  to  the  linear  logistic 
model,  but  there  are  some  differences  between  the  former  two  approaches,  particu¬ 
larly  for  birthweights  in  the  900-1,500-gram  range. 


11.6  Concluding  Comments 

In  this  chapter  we  have  described  smoothing  methods  for  general  data  types 
based  on  spline  models  and  kernel-based  methods.  A  variety  of  spline  models  are 
available,  but  we  emphasize  that  the  choice  of  smoothing  parameter  will  often  be  far 
more  important  than  the  specific  model  chosen.  For  simple  scatterplot  smoothing, 
the  spline  and  kernel  techniques  of  Sects.  11.2  and  11.3  will  frequently  produce 
very  similar  results.  If  inference  is  required,  penalized  regression  splines  are  a  class 
for  which  the  theory  is  well  developed  and  for  which  much  practical  experience  has 
been  gathered.  To  obtain  confidence  intervals  for  the  complete  curve,  a  Bayesian 
solution  is  recommended;  see  Marra  and  Wood  (2012).  For  inference  about  a  curve, 
including  confidence  bands,  care  must  be  taken  in  variance  estimation,  as  described 
in  Sect.  1 1.4.  In  terms  of  smoothing  parameter  choice,  there  will  often  be  no  clear 
optimal  choice,  and  a  visual  examination  of  the  resultant  fit  is  always  recommended. 

Kernel-based  methods  are  very  convenient  analytically,  and  we  have  seen  that 
expressions  for  the  bias  and  variance  are  available  in  closed  form  which  allows 
insight  into  when  they  might  preform  well.  Spline  models  are  not  so  conducive  to 
such  analysis  though  penalized  regression  splines  have  the  great  advantage  of  having 
a  mixed  model  representation  which  allows  the  incorporation  of  random  effects  and 
the  estimation  of  smoothing  parameters  using  conventional  estimation  techniques. 


11.7  Bibliographic  Notes 

Book-length  treatments  on  spline  methods  include  Wabha  (1990)  and  Gu  (2002). 
A  key  early  reference  on  spline  smoothing  is  Reinsch  (1967).  The  book  of  Wand 
and  Jones  (1995)  is  an  excellent  introduction  to  kernel  methods.  Local  polynomial 
methods  are  described  in  detail  in  Fan  and  Gijbels  (1996)  and  Loader  (1999). 
Bowman  and  Azzalini  (1997)  provides  a  more  applied  slant.  The  work  of  Ruppert 
et  al.  (2003)  is  a  readable  account  of  smoothing  methods,  with  an  emphasis  on  the 
mixed  model  representation  of  penalized  regression  splines. 
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11.8  Exercises 

11.1  Based  on  (1 1.7)  and  (1 1.8),  write  code,  for  example,  within  R,  to  produce  plots 
of  the  /i-spline  basis  functions  of  order  M  =  1,  2, 3, 4,  with  L  =  9  knots  and 
for  x  €  [0, 1]. 

1 1.2  Prove  that  (1 1.22)  and  (1 1.23)  are  equivalent  to  (1 1.24). 

11.3  Show  that  an  alternative  basis  for  the  natural  cubic  spline  given  by  ( 1 1 .3),  with 
constraints  (1 1.4)  and  (1 1.5),  is 


hi  (a;)  =  1,  h2(x)  =  x,  hi+2(x )  =  di(x)  -  dL-i{x) 


where 


o  - 


di{  x) 


Cl  -  Cz 


11.4  In  this  question,  various  models  will  be  fit  to  the  fossil  data  of  Chaudhuri  and 
Matron  (1999).  These  data  consist  of  106  measurements  of  ratios  of  strontium 
isotopes  found  in  fossil  shells  and  their  age.  These  data  are  available  in  the  R 
package  SemiPar  and  are  named  fossil.  Fit  the  following  models  to  these 
data: 

(a)  A  natural  cubic  spline  (this  model  has  n  knots),  using  ordinary  cross- 
validation  to  select  the  smoothing  parameter. 

(b)  A  natural  cubic  spline  (this  model  has  n  knots),  using  generalized  cross- 
validation  to  select  the  smoothing  parameter. 

(c)  A  penalized  cubic  regression  spline  with  L  =  20  equally  spaced  knots, 
using  ordinary  cross-validation  to  select  the  smoothing  parameter. 

(d)  A  penalized  cubic  regression  spline  with  L  =  20  equally  spaced  knots, 
using  generalized  cross-validation  to  select  the  smoothing  parameter. 

(e)  A  penalized  cubic  regression  spline  with  L  =  20  equally  spaced  knots, 
using  a  mixed  model  representation  to  select  the  smoothing  parameter. 

In  each  case  report  /( x),  along  with  an  asymptotic  95%  confidence  interval, 
for  the  (smoothed)  function,  at  x  =  95  and  x  =  115  years. 

11.5  In  this  question  a  dataset  that  concerns  cosmic  microwave  background  (CMB) 
will  be  analyzed.  These  data  are  available  at  the  book  website;  the  first  column 
is  the  wave  number  (the  x  variable),  while  the  second  column  is  the  estimated 
spectrum  (the  y  variable): 

(a)  Fit  a  penalized  cubic  regression  spline  using,  for  example,  the  R  package 
mgcv. 

(b)  Fit  a  Nadaray a- Watson  locally  constant  estimator. 

(c)  Fit  a  locally  linear  polynomial  model. 

(d)  Which  of  the  three  models  appears  to  give  the  best  fit  to  these  data? 

(e)  Obtain  residuals  from  the  fit  in  part  (c)  and  form  the  log  of  the  squared 
residuals.  Model  the  latter  as  a  function  of  x. 
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(f)  Compare  the  model  for  the  fitted  standard  deviation  with  the  estimated 
standard  error  (which  is  the  third  column  of  the  data). 

(g)  Reestimate  the  linear  polynomial  model,  weighting  the  observations  by  the 
reciprocal  of  the  variance,  where  the  latter  is  the  square  of  the  estimated 
standard  errors  (column  three  of  the  data).  Repeat  using  your  estimated 
variance  function. 

(h)  Does  the  fit  appear  improved  when  compared  with  constant  weighting? 

At  each  stage  provide  a  careful  description  of  how  the  models  were  fitted. 
For  example,  in  (a),  how  were  the  knots  chosen,  and  in  (b)  and  (c),  what 
kernels  and  smoothing  parameters  were  used  and  why? 

1 1.6  For  the  locally  linear  polynomial  fit  described  in  Sect.  1 1.3.4,  show  that 

_  T,i=1Wj(x)Yi  (  .Eti  Wi(x)(xi-xw)Yi 

n  rl  w(r\  w  V”  w(r)(r--T  )2 

where  xw  =  i  Wi(x)xi/  wi(x)  and  wi(x)  =  K[(x  -  xi)/^\  is  a 

kernel. 


Chapter  12 

Nonpar ametric  Regression  with  Multiple 
Predictors 


12.1  Introduction 

In  this  chapter  we  describe  how  the  methods  described  in  Chaps.  10  and  1 1  may  be 
extended  to  the  situation  in  which  there  are  multiple  predictors.  We  also  provide  a 
description  of  methods  for  classification,  concentrating  on  approaches  that  are  more 
model,  as  opposed  to  algorithm  based. 

To  motivate  the  ensuing  description  of  modeling  with  multiple  covariates, 
suppose  that  xn , . . . ,  xtk  are  /,;  covariates  measured  on  individual  i,  with  Y,  a 
univariate  response.  In  Chap.  6  generalized  linear  models  (GLMs)  were  considered 
in  detail,  and  we  begin  this  chapter  by  relaxing  the  linearity  assumption  via  so- 
called  generalized  additive  models.  A  GLM  has  Yi  independently  distributed  from 
an  exponential  family  with  E [Yj  |  x,]  =  A  link  function  g(gi)  then  connects  the 
mean  to  a  linear  predictor 


g(Hi)  =  /30+/3i2Ti  +  . . .  +  PkXik-  (12.1) 

This  model  is  readily  interpreted  but  has  two  serious  restrictions.  First,  we  are 
constrained  to  linearity  on  the  link  function  scale.  Transformations  of  x  values  or 
inclusion  of  polynomial  terms  may  relax  this  assumption  somewhat,  but  we  may 
desire  a  more  flexible  form.  Second,  we  are  only  modeling  each  covariate  separately. 
We  can  add  interactions  but  may  prefer  an  automatic  method  for  seeing  the  way  in 
which  the  response  is  associated  with  two  or  more  variables. 

A  general  specification  with  k  covariates  is 

g{gi)  =  f{xn,xi2, . .  .  ,xik).  (12.2) 

Flexible  modeling  of  the  complete  /c-dimensional  surface  is  extremely  difficult  to 
achieve  due  to  the  curse  of  dimensionality.  To  capture  “local”  behavior  in  high 
dimensions  requires  a  large  number  of  data  points.  To  illustrate,  suppose  we  wish 
to  smooth  a  function  at  a  point  using  covariates  within  a  fc-dimensional  hypercube 
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centered  at  that  point,  and  suppose  also  that  the  covariates  are  uniformly  distributed 
in  the  fc-dimensional  unit  hypercube.  To  capture  a  proportion  q  of  the  unit  volume 
requires  the  expected  edge  length  to  be  ql/'k .  For  example,  to  capture  1%  of  the 
points  in  k  =  4  dimensions  requires  0.011/4  =  0.32  of  the  unit  length  of  each 
variable  to  be  covered.  In  other  words,  “local”  has  to  extend  a  long  way  in  higher 
dimensions,  and  so  modeling  a  response  as  a  function  of  multiple  covariates  using 
local  smoothing  becomes  increasingly  more  difficult  as  the  number  of  covariates 
grows. 

The  outline  of  this  chapter  is  as  follows.  The  modeling  of  multiple  predictors 
via  the  popular  class  of  generalized  additive  models  is  the  subject  of  Sect.  12.2. 
Section  12.3  extends  the  spline  models  of  Sect.  1 1.2  to  the  multiple  covariate  case, 
including  descriptions  of  natural  thin  plate  splines,  thin  plate  regression  splines, 
and  tensor  product  splines.  The  kernel  methods  of  Sect.  11.3  are  described  for 
multiple  covariates  in  Sect.  12.4.  Section  12.5  considers  approaches  to  smoothing 
parameter  estimation  including  the  use  of  a  mixed  model  formulation.  Varying- 
coefficient  models  provide  one  approach  to  modeling  interactions,  and  these  are 
outlined  in  Sect.  12.6.  Moving  towards  classification,  regression  tree  methods  are 
discussed  in  Sect.  12.7.  Section  12.8  is  dedicated  to  a  brief  description  of  methods 
for  classification,  including  logistic  modeling,  linear  and  quadratic  discriminant 
analysis,  kernel  density  estimation,  classification  trees,  bagging,  and  random  forests. 
Concluding  comments  appear  in  Sect.  12.9.  Section  12.10  gives  references  to 
additional  approaches  and  more  detailed  descriptions  of  the  approaches  considered 
here. 


12.2  Generalized  Additive  Models 
12.2.1  Model  Formulation 

Generalized  additive  models  (GAMs)  are  an  extremely  popular,  simple  and  inter¬ 
pretable  extension  of  GLMs  (which  were  described  in  Sect.  6.3).  The  simplest  GAM 
extends  the  linear  predictor  (12.1)  of  the  GLM  to  the  additive  form 

g{Hi)  =  /?o  +  fi(xa)  +  fiixa)  +  ■  ■  ■  +  fk(xik )  (12.3) 

where  /3q  is  the  intercept  and  fj(-),  j  =  1, . . . ,  k  are  a  set  of  smooth  functions. 
Each  of  the  functions  fj(-)  may  be  modeled  using  different  techniques,  with  splines 
and  kernel  local  polynomials  (as  described  in  Chap.  11)  being  obvious  choices. 
For  reasons  of  identifiability,  we  impose  =  0,  for  j  = 

A  GAM  may  also  consist  of  smooth  terms  that  are  functions  of  pairs,  or  triples  of 
variables,  providing  a  compromise  between  the  simplest  model  with  k  smoothers 
and  the  “full”  model  (12.2)  which  allows  interactions  between  all  variables.  The 
multivariate  spline  models  of  Sect.  12.3  provide  one  approach  to  the  modeling 
of  more  than  a  single  variable.  As  a  concrete  example  of  a  GAM,  suppose  that 
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univariate  penalized  regression  splines  (Sect.  11.5.1)  are  used  for  each  of  the 
covariates,  with  the  spline  for  covariate  j  being  of  degree  pj  and  with  knot  locations 
1  =  1,...,  Lj.  The  GAM  is 


with  penalization  applied  to  the  coefficients  bj  =  [bji, . . . ,  r  .]T,  as  described  in 
Sect.  1 1.2.5.  For  example,  penalty  j  may  be  of  the  form 


Model  (12.3)  is  very  simple  to  interpret  since  the  smoother  for  element  j  of  x, 
is  the  same  regardless  of  the  values  of  the  other  elements.  Hence,  each  of 
the  fj  terms  may  be  plotted  to  visually  examine  the  relationship  between  Y  and  Xj ; 
Fig.  12.1  provides  an  example.  Model  (12.3)  is  also  computationally  convenient,  as 
we  shall  see  in  Sect.  12.2.2. 

A  semiparametric  model  is  one  in  which  a  subset  of  the  covariates  are  modeled 
parametrically,  with  the  remainder  modeled  nonparametrically.  Specifically,  let 
Zi  =  [zn, . . . ,  Ziq]  represent  the  sets  of  variables  we  wish  to  model  parametrically 
and  /3  =  [/?i, . . . ,  /?g]T  the  set  of  associated  regression  coefficients.  Then  ( 12.3)  is 
simply  extended  to 


k 


We  saw  an  example  of  this  form  in  Sect.  1 1 .2.9  in  which  spinal  bone  marrow  density 
was  modeled  as  a  parametric  function  of  ethnicity  and  as  a  nonparametric  function 
of  age. 


12.2.2  Computation  via  Backfitting 

The  structure  of  an  additive  model  suggests  a  simple  and  intuitive  fitting  algorithm. 
Consider  first  the  linear  link  g(pt)  =  pi.  to  give  the  additive  model 


Yi  —  Po  +  fl(xn )  +  72(^12)  +  •  •  •  +  fk{Xik)  +  £;• 


(12.4) 


Define  partial  residuals 


k 


=Yi-  ft o  - 


z=i  4A? 


(12.5) 
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for  j  =  1, ...  ,k.  For  these  residuals, 

Ek0)  I  *«]  =  fjM, 

which  suggests  we  can  estimate  fj,  using  as  response  the  residuals  . .  . ,  riP . 
Iterating  across  j  produces  the  backfitting  algorithm.  Backfitting  proceeds  as 
follows: 

1 .  Initialize:  /30  =  E  YIi=i  Vi  and  fj  =  0  for  j  =  1, . . . ,  k. 

2.  For  a  generic  smoother  Sj,  cycle  over  j  repeatedly: 


with  given  by  (12.5),  until  the  functions  fj  change  by  less  than  some 
prespecified  threshold. 

Buja  et  al.  (1989)  describe  the  convergence  properties  of  backfitting.  For  general 
responses  beyond  (12.4),  the  backfitting  algorithm  uses  the  “working”  residuals,  as 
defined  with  respect  to  the  IRLS  algorithm  in  Sect.  6.5.2.  Wood  (2006)  contains 
details  on  how  the  P-IRLS  algorithm  (Sect.  11.5.1)  may  be  extended  to  fit  GAMs. 
An  alternative  method  of  computation  for  GAMs,  based  on  a  mixed  model 
representation,  is  described  in  Sect.  12.5.2. 


Example:  Prostate  Cancer 

For  illustration,  we  fit  a  GAM  to  the  prostate  cancer  data  (Sect.  1.3.1)  in  order  to 
evaluate  whether  a  parametric  model  is  adequate.  The  response  is  log  PSA,  and 
we  model  each  of  log  cancer  volume,  log  weight,  log  age,  log  BPH,  log  capsular 
penetration,  and  PGS45  using  smooth  functions.  The  variable  SVI  is  binary,  and 
the  Gleason  score  can  take  just  4  values.  Hence,  for  these  two  variables,  we 
assume  a  parametric  linear  model.  The  smooth  functions  are  modeled  as  penalized 
regression  cubic  splines,  with  seven  knots  for  each  of  the  six  variables.  Generalized 
cross-validation  (GCV,  Sect.  10.6.3)  was  used  for  smoothing  parameter  estimation 
and  produced  effective  degrees  of  freedom  of  1,  1.1,  1.5,  1,  4.6,  and  3.9  for  the 
six  smooth  terms  (with  the  variable  order  being  as  in  Fig.  12.1).  The  resultant 
fitted  smooths,  with  shaded  bands  indicating  pointwise  asymptotic  95%  confidence 
intervals,  are  plotted  in  Fig.  12.1.  Panels  (e)  and  (f)  indicate  some  nonlinearity,  but 
the  wide  uncertainty  bands  and  flatness  of  the  curves  indicate  that  little  will  be 
lost  if  linear  terms  are  assumed  for  all  variables.  This  figure  illustrates  the  simple 
interpretation  afforded  by  GAMs,  since  each  smooth  shows  the  modeled  association 
with  that  variable,  with  all  other  variables  held  constant. 
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a  b 


Fig.  12.1  GAM  fits  to  the  prostate  cancer  data.  For  each  covariate,  penalized  cubic  regression 
splines  were  fitted,  with  seven  knots  each.  The  tick  marks  on  the  x  axis  indicate  the  covariate 
values 
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In  this  section  we  describe  how  splines  may  be  defined  as  a  function  of 
multivariate  x.  These  models  are  of  interest  in  their  own  right  and  may  be  used 
within  GAM  formulations  alongside  univariate  specifications.  For  example,  suppose 
associated  with  a  response  Y  there  are  three  variables  temperature  X\,  latitude  x-2, 
and  longitude  x:i .  In  this  situation  we  might  specify  a  GAM  with  two  smoothers, 
fi(xi)  for  temperature  and  f-iix-i,  x^),  a  bivariate  smoother  for  X2,X3  (since  we 
might  expect  an  interaction  involving  latitude  and  longitude). 
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12.3.1  Natural  Thin  Plate  Splines 

For  simplicity,  we  concentrate  on  the  two-dimensional  case  and  begin  by  debning  a 
measure  of  the  smoothness  of  a  function  f(x\ .  a:  2 ) .  In  the  one-dimensional  case,  the 
penalty  was  P(f)  =  f  fix)2  dx.  A  natural  penalty  term  to  measure  rapid  variation 
in  /  in  two  dimensions  is 


Changing  the  coordinates  by  rotation  or  translation  in  R2  does  not  affect  the  value  of 
the  penalty1  which  is  an  appealing  property.  The  penalty  is  always  nonnegative,  and, 
as  in  the  one-dimensional  case,  the  penalty  equals  zero,  if  and  only  if  f(x)  is  linear 
in  xi  and  X2,  as  we  now  show.  If  f(x)  is  linear,  then  it  is  clear  that  P(f)  is  zero. 
Conversely,  if  P{f)  =  0,  all  of  the  second  derivatives  are  zero.  Now,  d2  f  /  dx\  =  0 
implies  f(xi,X2)  =  a[x2)x\  +  b(x 2)  for  functions  a(-)  and  &(•).  The  condition 
d2f/d X\dx2  =  0  gives  a'(x 2)  =  0  so  that  <2(2:2)  =  a  for  some  constant  a.  Finally, 
d2f/dx%  =  0  implies  b"(x 2)  =  0  so  that  b'( X2)  =  b  and  b{ X2)  =  bx 2  +  c,  for 
constants  b  and  c.  It  follows  that 


f{x  1,  X2)  =  axi  +  bx  2  +  c 


is  linear. 

We  wish  to  minimize  the  penalized  sum  of  squares 


n 


(12.7) 


with  penalization  term  (12.6).  As  shown  by  Green  and  Silverman  (1994,  Chap.  7), 
the  unique  minimizer  is  provided  by  the  natural  thin  plate  spline  with  knots  at  the 
observed  data,  which  is  defined  as 


n 


(12.8) 


where 


8ir 


r2  log(r)  for  r  >  0 
0  for  r  =  0 


'This  requirement  is  natural  in  a  spatial  context  where  the  coordinate  directions  and  the  position 
of  origin  are  arbitrary. 
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and  the  unknown  bi  are  constrained  via 


n  n  n 


i=  1  i=  1  i=  1 


that  is,  xTb  =  0,  where  x  =  [xk, . . . ,  xn]T  is  n  x  3  with  Xi  =  [1,  cc^i ,  x^]  and 


b  =  [6i, . . .  ,6„]T. 


Such  a  spline  provides  the  unique  minimizer  of  P(f)  among  interpolating 
functions.  Interested  readers  are  referred  to  Theorems  7.2  and  7.3  of  Green  and 
Silverman  (1994)  and  to  Duchon  (1977),  who  proved  optimality  and  uniqueness 
properties  for  natural  thin  plate  splines.  Consequently,  the  one-dimensional  result 
outlined  in  Sect.  1 1.2.3  holds  in  two  dimensions  also.  If  /  is  a  natural  thin  plate 
spline,  it  can  be  shown  that  the  penalty  (12.6)  is  given  by  P(f)  =  bTEb  where  E  is 
the  n  x  n  matrix  with  Ei3  =  t](\\xi  —  Xj\\),  i,j  =  1, . . . ,  n  (Green  and  Silverman 
1994,  Theorem  7.1).  The  minimization  (12.7)  with  penalty  term  (12.6)  is 


(y  —  x(3  —  Eb)T  (y  —  xf3  —  Eb)  +  A bTEb 


(12.9) 


subject  again  to  xTb  =  0  and  where  f3  =  [/?o,  Pi,  62]'' ■  Green  and  Silverman  (1994, 
p.  148)  show  that  this  system  of  equations  has  a  unique  solution. 

In  terms  of  a  mechanical  interpretation,  suppose  that  an  infinite  elastic  flat  plate 
interpolates  a  set  of  points  [xi,yi\,i  =  1 , ,n.  Then  the  “bending  energy”  of  the 
plate  is  proportional  to  the  penalty  term  (12.6),  and  the  minimum  energy  solution 
is  the  natural  thin  plate  spline.  Natural  thin  plate  regression  splines  can  be  easily 
generalized  to  dimensions  greater  than  two.  Green  and  Silverman  (1994,  Sect.  7.9) 
contains  details. 


12.3.2  Thin  Plate  Regression  Splines 

Natural  thin  plate  splines  are  very  appealing  since  they  remove  the  need  to  decide 
upon  knot  locations  or  basis  functions;  each  is  contained  in  (12.8).  In  practice, 
however,  thin  plate  splines  have  too  many  parameters.  A  thin  plate  regression 
spline  (TPRS)  reduces  the  dimension  of  the  space  of  the  “wiggly”  basis  (the  b,  's 
in  (12.8)),  while  leaving  (3  unchanged.  Specifically,  let  E  =  UDUT  be  the  eigen- 
decomposition  of  E ,  so  that  D  is  a  diagonal  matrix  containing  the  eigenvalues  of 
E  arranged  so  that  |DM|  >  | |,  i  =  2, . . .  ,n,  and  the  columns  of  U  are 
the  corresponding  eigenvectors.  Now,  let  Uk  denote  the  matrix  containing  the  first 
k  columns  of  U  and  Dk  the  top  left  k  x  /,:  submatrix  of  D.  Finally,  write  b  =  Ukbk 
so  that  b  is  restricted  to  the  column  space  of  Uk-  Then,  under  this  reduced  basis 
formulation,  analogous  to  (12.9),  we  minimize  with  respect  to  /3  and  bk 


(y  —  x(3  —  UkDkbky  (y  —  xf3  —  UkDkbk)  +  A b\Dkbk 
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subject  to  xTUkbk  =  0.  See  Wood  (2006,  Sect.  4.1.5)  for  further  details,  including 
the  manner  by  which  predictions  are  obtained  and  details  on  implementation.  In 
addition,  the  optimality  of  thin  plate  regression  splines  as  approximating  thin  plate 
splines  using  a  basis  of  low  rank  is  discussed.  Thin  plate  regression  splines  retain 
both  the  advantage  of  avoiding  the  choice  of  knot  locations  and  the  rotational 
invariance  of  thin  plate  splines. 


Example:  Prostate  Cancer 

For  illustration,  we  examine  the  association  between  the  log  of  PSA  and  log 
cancer  volume  and  log  weight.  Figure  12.2(a)  shows  the  two-dimensional  surface 
corresponding  to  a  model  that  is  linear  in  the  two  covariates  (and  in  particular  has 
no  interaction  term).  We  next  fit  a  GAM  with  a  TPRS  smoother  for  log  cancer 
volume  and  log  weight,  along  with  (univariate)  cubic  regression  splines  for  age, 
log  BPH.  log  capsular  penetration,  and  PGS45,  along  with  linear  terms  for  SVI 
and  the  Gleason  score.  Figure  12.2(b)  provides  a  perspective  plot  of  the  fitted 
bivariate  surface.  There  are  some  differences  between  this  plot  and  the  linear  model. 
In  particular  for  high  values  of  log  cancer  volume  and  low  values  of  log  weight, 
the  linear  no  interaction  model  gives  a  lower  prediction  than  the  TPRS  smoother. 
Overall,  however,  there  is  no  strong  evidence  of  an  interaction. 


12.3.3  Tensor  Product  Splines 

As  an  alternative  to  thin  plate  splines,  one  may  consider  products  of  basis  functions. 
Again,  suppose  that  x  £  R2  and  that  we  have  basis  functions  hji(xj),  l  = 
1, . . . ,  .Misrepresenting  Xj,j  =  1, 2.  Then,  the  Mi  XM2  dimensional  tensor  product 
basis  is  defined  by 

9 jin  (x)  =  hiji{x1)h2ja{x2),  ji  =  1,  •  •  • ,  Mi;  j2  =  1, .  ■  • ,  M2, 

which  leads  to  the  two-dimensional  predictive  function: 


Mi  M2 


f(x)  Pjija9jih(x)- 


11=1  12  =  1 


We  illustrate  this  construction  using  spline  bases.  Suppose  that  we  wish  to  specify 
linear  splines  with  L\  truncated  lines  for  x  \  and  L2  for  :/:2.  This  model  therefore  has 
Li  +  2  and  L2  +  2  bases  in  the  two  dimensions: 


1  ,x1,(x1  -£n)+,...,(a;i  -  £1  Lt)+, 

1  5  x2 ;  (%2  £21 )  +  ?  •  •  •  j  (x2  ^,2L2  )  +  ■ 


(12.10) 

(12.11) 
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Fig.  12.2  Perspective  plots 
of  the  fitted  surfaces  for  the 
variables  log  cancer  volume 
and  log  weight  in  the  prostate 
cancer  data:  (a)  linear  model, 
(b)  thin  plate  regression 
spline  model,  (c)  tensor 
product  spline  model 
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The  tensor  product  model  is 

f{x  i,x2)  =  /3o  +/3ixi  +  /32X2  +  P3X1X2 

+  X/  —  £lJi)+  +  X/  ^(*2  ~  ^2ta)  + 

Zi=l  Z2 — 1 

Li  L2 

+  El  cL1)x2(^i  -  Cih)+  +  El  °?2x^x2  -  6tj)+ 

Zi=1  Z2  —  1 

+  EE  -  £ni)+(*2  -  &*)+•  (12.12) 

Zi=l  Z2  — 1 

An  additive  model  would  correspond  to  the  first  two  lines  of  this  model  only  (with 
the  X\X2  term  removed),  illustrating  that  the  last  two  lines  are  modeling  interactions. 
The  unknown  parameters  associated  with  this  model  are 


Consequently,  there  are 


/3 

II 

Ta 

O 

,*]T 

..‘Sr 

b(2) 

=  [&iW,. 

c« 

■J£r 

c<2> 

=  [42),. 

oT 

11 

+  L\ 

+  L2  +  L 

to 

II 

parameters  in  the  tensor  product  model.  Clearly  the  dimensionality  of  the  basis 
increases  quickly  with  the  dimensionality  of  the  covariate  space  k.  See  Exercise  12.1 
for  an  example  of  the  construction  and  display  of  these  bases.  An  example  of  a  tensor 
product  fit  is  given  at  the  end  of  Section  12.5. 

The  fit  from  a  tensor  product  basis  is  not  invariant  to  the  orientation  of  the 
coordinate  axis.  Radial  invariance  can  be  achieved  with  basis  functions  of  the  form 
C(||x  —  4||),  with  4  =  [4i 5 C2]  and  for  some  univariate  function  C(-),  see  for 
example  Ruppert  et  al.  (2003).  The  value  of  the  function  at  x  only  depends  on  the 
distance  from  this  point  to  4.  and  so  the  function  is  radially  symmetric  about  this 
point. 
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12.4  Kernel  Methods  with  Multiple  Predictors 

In  principle,  the  extension  of  the  kernel  local  polynomial  smoothing  methods  of 
Sect.  11.3  is  straightforward;  one  simply  needs  to  choose  a  multivariate  weight 
function  (i.e.,  a  kernel)  and  a  multivariate  local  polynomial.  For  simplicity,  we 
consider  the  case  of  two  covariates  and  a  continuous  response  with  additive  errors: 


Yi  =  f{xn,xi2)  +  ti 

with  E[ej]  =  0,  var(ei)  =  a2  and  co v(ei,ej)  =  0  for  i  ^  j.  A  suitably  smooth 
function  may  be  approximated,  for  values  u  =  \u\ .  u2]  in  a  neighborhood  of  a  point 
x  =  [x'i ,  x 2 ] ,  by  a  second-order  Taylor  series  approximation: 


f(u) 


f(x)  +  (ui 


+  ( U2 


X2 ) 


df_ 

dx2 


+(tti  —  Xi)2 


i&£ 

2  dx\ 


+  (iti  -  Xi)(li2 


X2) 


d2f 

dx\dx2 


+  ( U2 


X2f 


1 

2  dx\ ' 


We  see  that  the  model  includes  an  interaction  term  ( ui  —  X\){u2  —  x2),  and  the 
approximation  suggests  that  for  a  prediction  at  the  point  x ,  we  can  use  the  local 
polynomial: 


P„  (it;  (3J  =  /?0a,  +  {ui  -  Xi)/3la.  +  {u2  -  x2)P2a> 

+  (ztl  -  Xx)2^  +  (U1  -  X\)(U2  -  x2)f3ia,  +  (u2  -  X2)2^. 

Estimation  proceeds  exactly  as  in  the  one-dimensional  case  by  choosing  (3X  to 
minimize  the  locally  weighted  sum  of  squares 

n 

Y^Wiix)  \Yi  -  P„  (xf,(3J}2  , 

i=l 

with  the  weights  Wi{x)  depending  on  a  two-dimensional  kernel  function.  The 
simplest  choice  is  the  product  of  one-dimensional  kernels,  that  is, 


Wi(x)  =  K\ 


Xl  -  Xj\ 
Ai 


The  htted  value  is 

J(x)  =P1B(x;PJ  =  j30a! 


Embedding  multivariate  local  polynomials  within  a  generalized  linear  model 
framework  is  straightforward,  by  simple  extension  of  the  approach  described  in 
Sect.  11.5.3. 
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As  with  multivariate  spline  methods,  the  local  polynomial  approach  becomes 
more  difficult  as  the  dimensionality  increases,  due  to  the  sparsity  of  points  in  high 
dimensions. 


12.5  Smoothing  Parameter  Estimation 
12.5.1  Conventional  Approaches 

The  simplest  way  to  control  the  level  of  smoothing  is  to  specify  an  effective  degrees 
of  freedom,  df( ,  for  each  of  the  j  =  1 .....  A:  smoothers  (where  we  have  assumed 
for  simplicity  that  we  are  modeling  k  univariate  smoothers). 

As  we  saw  in  Chap.  10,  there  are  two  ways  of  estimating  smoothing  parameters. 
The  first  is  to  attempt  to  minimize  prediction  error  which  may  be  represented 
by  AlC-like  criteria  or  via  cross-validation.  Such  procedures  were  described  in 
Sect.  10.6.  The  second  method  is  to  embed  the  penalized  smoothing  within  a  mixed 
model  framework  and  then  use  likelihood  (ML  or  REML)  or  Bayesian  estimation 
of  the  random  effects  variances.  This  approach  is  described  in  Sect.  12.5.2. 

For  GAMs  the  smoothing  of  multiple  parameters  may  be  estimated  during  the 
iterative  cycle  (e.g.,  within  the  P-IRLS  iterates),  which  is  known  as  performance 
iteration.  As  an  alternative,  fitting  may  be  carried  out  multiple  times  for  each  set 
of  smoothing  parameters,  which  is  known  as  outer  iteration.  The  latter  is  more 
reliable  but  requires  more  work  to  implement.  However,  the  methods  for  minimizing 
prediction  error  using  outer  iteration  described  in  Wood  (2008)  are  shown  to  be 
almost  as  computationally  efficient  as  performance  iteration. 


12.5.2  Mixed  Model  Formulation 

To  illustrate  the  general  technique,  consider  a  linear  additive  model  with  penalized 
regression  splines  providing  the  smoothing  for  each  of  the  k  covariates.  Further, 
assume  a  truncated  polynomial  representation  with  a  degree  p:/  polynomial  and 
Lj  knots  with  locations  £ji,  l  =  1 ,...  ,Lj,  associated  with  the  yth  smooth,  j  = 
1, ...  ,k.  A  mixed  model  representation  is 


k 


k 


(12.13) 
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with  a  |  of  ~  N(0,  of), 


xij  ~  ■  i  xij  ] :  Pj 


(3ji 

-  Pjpj  - 


and 


—  [(aty  £ji)+  >  •  ■  •  5  ^jLj)+  ]i  bj  — 


i 


l  bjPj  j 


The  parameters  /3i,...,/3k  are  treated  as  fixed  effects  with  b±,...,bk  being  a 
set  of  independent  random  effects.  The  penalization  is  incorporated  through  the 
introduction  of  k  sets  of  random  effects: 


bjl  |  Ojj  ~ ind.  N(0,  tTy),  l  —  1,  .  .  .  ,  Z/j 

for  j  =  1, . . . ,  k.  Inference,  from  either  a  likelihood  (Sect.  1 1.2.8)  or  Bayesian 
(Sect.  1 1.2.9)  perspective,  proceeds  exactly  as  in  the  univariate  covariate  case.  The 
extension  of  (12.13)  to  a  tensor  product  spline  model,  such  as  (12.12),  is  straight¬ 
forward.  Comparison  with  (12.12)  reveals  the  strong  simplification  of  (12.13)  (with 
k  =  2  and  pj  =  1),  which  includes  no  cross-product  terms. 

One  can  estimate  the  variance  components  (and  hence  the  amount  of  smoothing) 
using  a  fully  Bayesian  approach  or  via  ML/REML.  With  a  likelihood-based 
approach,  one  requires  the  random  effects  to  be  integrated  from  the  model.  For 
non-Gaussian  response  models,  these  integrals  cannot  be  evaluated  analytically. 
Approaches  to  integration  were  reviewed  in  Sect.  3.7.  One  iterative  strategy  we 
mention  briefly  here  linearizes  the  model,  which  allows  linear  methods  of  estimation 
to  be  applied.  This  strategy  is  known  as  penalized  quasi-likelihood  (PQL,  Breslow 
and  Clayton  1993)  and  is  essentially  equivalent  to  performance  iteration.  Using  the 
more  sophisticated  Laplace  approximation  gives  one  approach  to  outer  iteration. 
See  Wood  (2011)  for  details  of  a  method  that  is  almost  as  computationally  efficient 
as  performance  iteration.  Bayesian  approaches  typically  use  MCMC  (Sect.  3.8)  or 
INLA  (Sect.  3.7.4). 

Some  theoretical  work  (Wabha  1985;  Kauermann  2005)  suggests  that  methods 
that  minimize  prediction  error  criteria  give  better  prediction  error  asymptotically, 
but  have  slower  convergence  of  smoothing  parameters  (Hardle  et  al.  1988).  Reiss 
and  Ogden  (2009)  show  that  the  equations  by  which  generalized  cross-validation 
(GCV)  and  REML  estimates  are  obtained  have  a  similar  form  and  use  this  to 
examine  the  properties  of  the  estimates.  They  find  that  converging  to  a  local, 
rather  than  a  global,  solution  appears  to  happen  more  frequently  for  GCV  than  for 
REML.  Hence,  care  is  required  in  finding  a  solution,  and  Reiss  and  Ogden  (2009) 
recommend  plotting  the  criteria  function  over  a  wide  range  of  values.  Wood  (201 1) 
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discusses  how  GCV  can  lead  to  “occasional  severe  under-smoothing,”  and  this 
is  backed  up  by  Reiss  and  Ogden  (2009)  who  argue,  based  on  their  theoretical 
derivations,  that  REML  estimates  will  tend  to  be  more  stable  than  GCV  estimates. 


Example:  Prostate  Cancer 

We  return  to  the  prostate  cancer  example  and  fit  a  GAM  with  a  tensor  product 
spline  smoother  for  log  cancer  volume  and  log  weight,  along  with  (univariate)  cubic 
regression  splines  for  age,  log  BPH,  log  capsular  penetration  and  PGS45,  and  with 
linear  terms  for  SVI  and  the  Gleason  score.  Each  of  the  constituent  smoothers  in  the 
tensor  product  is  taken  to  be  a  cubic  regression  spline  with  bases  of  size  6  for  each  of 
the  components.  GCV  is  used  for  estimation  of  the  smoothing  parameters  and  results 
in  an  effective  degrees  of  freedom  of  12.4  for  the  tensor  product  term.  Figure  12.2c 
provides  a  perspective  plot  of  the  fitted  bivariate  surface.  It  is  reassuring  that  the  fit 
is  very  similar  to  the  thin  plate  regression  spline  in  panel  (b). 


12.6  Varying- Coefficient  Models 

Varying-coefficient  models  (Cleveland  et  al.  1991;  Hastie  and  Tibshirani  1993) 
provide  another  flexible  model  based  on  a  linear  form  but  with  model  coefficients 
that  vary  smoothly  as  a  function  of  other  variables.  We  begin  our  discussion  by 
giving  an  example  with  two  covariates,  x  and  z.  The  model  is 

E[F  \x,z\=  fx  =  p0(z)  +  Px{z)x  (12.14) 

so  that  we  have  a  linear  regression  with  both  the  intercept  and  the  slope  correspond¬ 
ing  to  x  being  smooth  functions  of z.  The  first  thing  to  note  is  that  the  model  is  not 
symmetric  in  the  two  covariates.  Rather,  the  linear  association  between  Y  and  x  is 
modified  by  z,  and  we  have  a  specific  form  of  interaction  model. 

The  extension  to  a  generalized  linear/additive  model  setting  is  clear,  on  replace¬ 
ment  of  E[Y  |  x ,  z)  by  g(fi).  With  covariates  x  =  [x\, . . . ,  Xk]  and  z  the  model  is 

k 

9(P)  =  Po{z)  +  ^2/3j{z)xj, 

j=  i 

so  that  each  of  the  slopes  is  modified  by  2.  Computation  and  inference  are 
straightforward  for  the  varying-coefficient  model. 


12.6  Varying-Coefficient  Models 


611 


We  return  to  the  case  of  just  two  variables,  x  and  z,  and  consider  penalized  linear 
spline  smoothers  with  L  knots  having  locations  for  each  of  the  intercept  and 
slope.  Then,  model  (12.14)  becomes: 


E [Y  |  x,z\  =  a(00)  +  a(°]z  +  ^  b[°\z  -  &)+ 

i=i 

" - V - " 

Po  (z) 

+  z  +  J2b<i1]  (z  ~  ZO+'j  x ■ 

- V  ^ 

Pl(z) 

A  mixed  model  representation  (Ruppert  et  al.  2003,  Sect.  12.4)  assumes  independent 
random  effects  with  b |  erf  ~ud  N(0,<Tq)  and  b^  \  erf  ~ud  N(0,  a\)  for  l  = 
1 

An  obvious  application  of  varying-coefficient  models  is  in  the  situation  in  which 
the  modifying  variables  correspond  to  time.  As  a  simple  example,  if  a  response  and 
covariate  x  are  collected  over  time,  we  might  consider  the  model 

Yt  =  a  +  fi(t)xt  +  et,  (12.15) 

where  we  have  chosen  a  simple  model  in  which  the  slope,  and  not  the  intercept, 
is  a  function  of  time.  We  briefly  digress  to  provide  a  link  with  Bayesian  dynamic 
linear  models,  which  were  developed  for  the  analysis  of  time  series  data  and  allow 
regression  coefficients  to  vary  according  to  an  autoregressive  model.  The  simplest 
dynamic  linear  model  (see,  for  example,  West  and  Harrison  1997)  is  defined  by  the 
equations 


—  a  +  Xtpt  +  £t,  et  |  of  ~ad  N(0,  erf) 
fit  =  fit- 1  +  St,  St  |  erf  ~ad  N(0,  erf). 

Accordingly,  we  have  a  varying-coefficient  model  of  the  form  of  (12.15)  with 
smoothing  carried  out  via  a  particular  flexible  form:  a  first-order  Markov  model 
(the  limiting  form  of  an  autoregressive  model,  see  Sect.  8.4.2).  This  is  also  a  mixed 
model  but  with  the  first  differences  (fit  —  fit- i)  being  modeled.  A  spatial  form  of 
this  autoregressive  model  was  considered  in  Sect.  9.7. 


Example:  Ethanol  Data 

We  illustrate  the  use  of  a  varying-coefficient  model  with  the  ethanol  data  described 
in  Sect.  10.2.2.  Figure  10.2  provides  a  three-dimensional  plot  of  these  data.  An 
initial  analysis,  with  NOx  modeled  as  a  linear  function  of  C  and  a  quadratic  function 
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Fig.  12.3  NOx  versus  C  for  nine  subsets  of  the  ethanol  data  (defined  via  the  quantiles  of  E),  with 
linear  model  fits  superimposed 


of  E,  was  found  to  provide  a  poor  fit.  Specifically,  the  association  between  NOx  and 
E  is  far  more  complex  than  quadratic.  To  examine  the  association  more  closely 
and  to  motivate  the  varying-coefficient  model,  we  split  the  E  variable  into  nine 
bins,  based  on  the  quantiles  of  E,  with  an  approximately  equal  number  of  pairs 
of  [NOx,  C]  points  within  each  bin.  We  then  fit  a  linear  model  to  each  portion  of  the 
data.  Figure  12.3  shows  the  resultant  data  and  fitted  lines.  A  linear  model  appears, 
at  least  visually,  to  provide  a  reasonable  fit  in  each  panel,  though  the  intercepts  and 
slopes  vary  across  the  quantiles  of  E. 

Figures  12.4(a)  and  (b)  plot  these  intercepts  and  slopes  as  a  function  of  the 
midpoint  of  the  bins  for  E,  and  we  see  that  the  coefficients  vary  in  a  non-monotonic 
fashion.  Consequently,  we  fit  the  varying-coefficient  model 

E  [NOx  |  C,  E]  =  /?0(E)  +ft(E)  x  C,  (12.16) 

with  /3o(E)  and  /?i(E)  both  modeled  as  penalized  cubic  regression  splines  with  10 
knots  each.  The  smoothing  parameters  for  the  smoothers  were  chosen  using  GCV, 
which  resulted  in  6.4  and  4.7  effective  degrees  of  freedom  for  the  intercept  and 
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Fig.  12.4  (a)  Intercepts  and  (b)  slopes  from  linear  models  fitted  to  the  ethanol  data  in  which 
the  response  is  NOx  and  the  covariate  is  C,  with  the  nine  groups  defined  by  quantiles  of  the  E 
variable.  The  fitted  curves  are  from  a  varying-coefficient  model  in  which  the  intercepts  and  slopes 
are  modeled  as  cubic  regression  splines  in  E 


Fig.  12.5  Image  plot  of  the 
predictive  surface  from  the 
varying-coefficient 
model  (12.16)  fitted  to  the 
ethanol  data.  Light  and  dark 
gray  values  indicate, 
respectively,  high  and  low 
values  of  expected  NOx 


E 


slope,  respectively.  The  fitted  smooths  are  shown  on  Fig.  12.4,  and  we  see  that  the 
intercept  and  slope  rise  then  fall  as  a  function  of  E. 

Figure  12.5  gives  the  fitted  surface.  The  inverted  U-shape  in  E  is  evident.  More 
subtly,  as  we  saw  in  Fig.  12.4(b),  the  strength  (and  sign)  of  the  linear  association 
between  NOx  and  C  varies  as  a  function  of  E. 
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12.7  Regression  Trees 
12.7.1  Hierarchical  Partitioning 

In  this  section  we  consider  a  quite  different  approach  to  modeling,  in  which  the 
covariate  space  is  partitioned  into  regions  within  which  the  response  is  relatively 
homogeneous.  A  key  feature  is  that  although  a  model  for  the  data  is  produced,  the 
approach  is  best  described  algorithmically.  As  we  will  see,  a  tree-based  approach 
to  the  construction  of  partitions  is  both  interpretable  and  amenable  to  computation. 
Our  development  follows  similar  lines  to  Hastie  et  al.  (2009,  Sect.  9.2). 

In  order  to  motivate  tree-based  models,  we  first  take  a  step  back  and  consider 
ways  of  constructing  partitions;  the  aim  is  to  produce  regions  of  the  covariate  space 
within  which  the  response  is  constant.  An  obvious  statement  is  that,  in  practice, 
clearly  the  shapes  and  sizes  of  the  partition  will  be  dependent  on  the  distribution  of 
the  covariates.  There  are  clearly  many  possible  ways  (models)  by  which  partitions 
might  be  defined,  beginning  with  a  completely  unrestricted  search  in  which  there  are 
no  constraints  on  the  shapes  and  sizes  of  the  partition  region.  This  is  too  complex  a 
task  to  practically  accomplish,  however.2  We  examine  a  series  of  partitions  for  the 
case  of  two  covariates,  X\  and  xi,  leading  to  a  particular  mechanism  for  partitioning. 
Figure  12.6(a)  shows  partitions  defined  by  straight  lines  in  the  covariate  space,  with 
the  lines  not  constrained  to  be  parallel  to  either  axis  (clearly  we  could  start  with 
partitions  of  even  greater  complexity).  Explaining  how  the  response  varies  as  a 
function  of  X\  and  x-i  for  the  particular  partition  in  Fig.  12.6(a)  is  not  easy,  however. 
In  addition,  searching  for  the  best  partitions  defined  with  respect  to  lines  of  this 
form  is  very  difficult,  particularly  when  the  covariate  space  is  high  dimensional. 
Figure  12.6(b)  displays  a  partition  in  which  the  space  is  dissected  with  lines  that 
are  parallel  to  the  axes,  and,  though  simpler  to  describe  than  the  previous  case,  the 
regions  are  still  not  straightforward  to  explain  or  compute. 


Fig.  12.6  Examples  of 
flexible  partitions  of  the 
[xi ,  x%\  space  that  use 
straight  lines  to  define  the 
partitions 


*1 


*1 


2Methods  aimed  in  this  direction  do  exist,  for  example,  in  the  spatial  literature.  Knorr-Held  and 
Rasser  (2000)  and  Denison  and  Holmes  (2001)  describe  Bayesian  partition  models  based  on 
Voronoi  tessellations.  These  models  are  computationally  expensive  to  implement  and  have  so  far 
been  restricted  to  two-dimensional  covariate  settings. 
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Fig.  12.7  Hierarchical 
binary  tree  partition  of  the 
[mi ,  m2]  space 
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Fig.  12.8  Hypothetical 
regression  tree  corresponding 
to  Fig.  12.7.  The  four  splits 
lead  to  five  terminal  nodes 
(leaves),  labeled  Ri , . . . ,  R$ 


x2  <t. 


Xi  <  t2 
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Figure  12.7  shows  a  tree-based  partition  that  is  based  on  successive  binary 
partitions  of  the  predictor  space,  again  to  produce  subsets  of  the  response  which 
are  relatively  constant.  Splits  are  only  allowed  within,  and  not  between,  partitions. 
Such  a  method  has  the  advantage  of  producing  models  that  are  relatively  easy  to 
explain,  since  they  follow  simple  rules,  and  may  be  computed  without  too  much 
difficulty.  The  partition  in  Fig.  12.7  is  generated  by  the  algorithm  illustrated  in  the 
form  of  a  “tree”  in  Fig.  12.8  (notice  that  trees  are  usually  shown  as  growing  down 
the  page).  We  describe  in  detail  how  this  partition  is  constructed. 

The  terminology  we  use  is  graphical.  Decisions  are  taken  at  nodes ,  and  the  wot 
of  the  tree  is  the  top  node.  The  terminal  nodes  are  the  leaves,  and  covariate  points 
x  assigned  to  these  nodes  are  assigned  a  constant  fitted  value  (which  is  called  a 
classification  if  the  response  is  discrete).  Attached  to  each  nonterminal  node  is  a 
question  that  determines  a  split  of  the  data.  Suppose  a  tree  T0  is  grown.  A  subtree 
of  T0  is  a  tree  with  root  a  node  of  To;  it  is  a  rooted  subtree  if  its  root  is  the  root  of 
T0.  The  size  of  a  tree,  denoted  |T|,  is  the  number  of  leaves. 

In  Fig.  12.8,  the  first  split  is  according  to  X2  <  t\ .  If  this  condition  is  true,  then 
we  follow  the  left  branch  and  next  split  on  Xi  <  t2,  to  give  leaves  with  labels  R\ 
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Fig.  12.9  Hypothetical 
surface  corresponding  to  the 
partition  of  Fig.  12.7  and  the 
tree  of  Fig.  12.8 


and  i?2-  If  we  follow  the  right  hand  branch  and  X\  >  t3,  we  terminate  at  the  R3 
leaf.  If  A'i  <  t,3,  we  split  again  via  A'2  <  t  \  to  give  the  leaves  R,\  and  R-j .  The 
model  resulting  from  these  operations  is 

5 

f(xi,x2)  =  ^2(3jl  ( [a:i,  a;2]  &  Rj), 

3= 1 

where  the  indicator  I  {[xi,  x2]  £  Rj}  is  1  if  the  point  [x\,  x2\  lies  in  region  Rj  and 
is  equal  to  0  otherwise.  Figure  12.9  is  a  hypothetical  surface  corresponding  to  the 
tree  shown  in  Fig.  12.8. 

The  five  basis  functions  hj,  which  correspond  to  Rj,  j  =  1, . . . ,  5,  are: 

hi(xi,x2)  =  I{x2  <  ti)  x  /( x\  <  t2) 

h2(xi,x2)  =  I(x 2  <  ti)  x  I(x  1  >  t2) 

hj,(xi,x2)  =  I(x 2  >  h)  x  I(x  1  >  t3 ) 

h.4{xi,x2)  =  I(x 2  >  ti)  x  I(x  1  <  t3 )  x  I(x2  <  ti) 

h3{xi,x2)  =  I(x 2  >  ti)  x  /( x\  <  t3)  x  /( x2  >  ti). 

Basis  hj  corresponds  to  Rj,  j  =  1, . . . ,  5.  These  bases  cover  the  covariate  space 
and  at  any  point  x  only  one  basis  function  is  nonzero,  so  that  we  have  a  partition. 
We  emphasize  that  these  bases  are  not  specified  a  priori,  but  selected  on  the  basis 
of  the  observed  data,  so  that  they  are  locally  adaptive.  A  regression  tree  provides 
a  hierarchical  method  of  describing  the  partitions  (i.e.,  the  partitioning  is  defined 
through  a  nested  series  of  instructions),  which  aids  greatly  in  describing  the  model. 
Tree  models  effectively  perform  variable  selection,  and  discovering  interactions  is 
implicit  in  the  process. 
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There  are  many  ways  in  which  we  could  go  about  “growing”  a  tree.  Clearly  we 
could  continue  to  split  the  data  until  each  leaf  contains  a  single  unique  set  of  x 
values,  but  this  would  lead  to  overfitting.  Many  approaches  grow  a  large  tree  and 
then  prune  it  back,  to  avoid  such  overfitting.  There  are  different  ways  to  both  split 
nodes  (e.g.,  only  binary  splits  may  be  performed)  and  prune  back  the  tree. 

We  now  describe  an  approach  to  tree-building  based  on  binary  splits.  Consider 
a  simple  situation  in  which  we  have  a  response  Y  and  k  continuous  predictors  xi, 
l  =  1, . . . ,  k.  A  common  implementation  considers  recursive  binary  partitions  in 
which  the  x  space  is  first  split  into  two  regions  on  the  basis  of  one  of  x±, . . . ,  Xk, 
with  the  variable  and  split  point  being  chosen  to  achieve  the  best  fit  (according 
to,  say,  the  residual  sum  of  squares,  or  more  generally  the  deviance).  There  are  a 
maximum  of  k(n—  1)  partitions  to  consider.  Next,  one  or  both  of  the  regions  are  split 
into  two  more  regions.  Only  partitions  within,  and  not  between,  current  partitions 
are  considered  at  each  step  of  the  algorithm.  This  process  is  continued  until  some 
stopping  rule  is  satisfied.  The  final  tree  may  then  be  pruned  back. 

For  ordered  categorical  variables,  the  above  procedure  poses  no  ambiguity,  but 
for  unordered  categorical  variables  with  more  than  two  levels,  we  may  divide  the 
levels  into  two  groups;  with  L  levels  there  are  2L_1  —  1  pairs  of  groups.  Note  that, 
in  general,  monotonic  transformations  of  quantitative  covariates  produce  identical 
results. 

When  the  algorithm  terminates,  we  end  up  with  a  regression  model  having  fitted 
values  fij  in  region  Rj,  that  is. 


j 


(12.17) 


An  obvious  estimator  is 


where  rij  is  the  number  of  observations  in  partition  Rj,  j  =  1, . . . ,  J  (so  that 
there  are  J  leaves).  Inherent  in  the  construction  of  this  unweighted  estimator  is 
an  assumption  that  the  error  terms  are  uncorrelated  with  constant  variance  (which 
is  consistent  with  choosing  the  splits  on  the  basis  of  minimizing  the  residual  sum  of 
squares). 

We  now  give  more  detail  on  how  regression  trees  are  “grown.”  The  algorithm 
automatically  decides  on  both  the  variable  on  which  to  split  and  on  the  split  points. 
To  find  the  best  tree,  we  start  with  all  the  data  and  proceed  with  a  greedy  algorithm. ' 
Consider  a  particular  variable  Xi  and  a  split  point  s  and  define 


R\(l,  s)  =  {  x  :  x\  <  s  } 
R2(l,  s)  =  {  x  :  xi  >  s  }. 


3  A  greedy  algorithm  is  one  in  which  “locally”  optimal  choices  are  made  at  each  stage. 
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We  seek  the  splitting  variable  index  l  and  split  point  s  that  solve 
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that  is,  that  minimizes  the  residual  sum  of  squares  among  models  with  two  response 
levels,  based  on  a  split  of  one  of  the  k  variables.  For  any  choice  of  l  and  s,  the  inner 
minimization  is  solved  by 


Each  of  the  covariates,  x±, . . . ,  x.k,  is  scanned  and,  for  each,  the  determination  of 
the  best  split  point  s  is  found,  which  is  fast.  Having  found  the  best  split,  we  partition 
the  data  into  the  two  resulting  regions,  and  the  splitting  process  is  then  repeated  on 
each  region  to  find  the  next  split. 

We  now  return  to  the  key  question:  How  large  a  tree  should  be  grown?  If  the  tree 
is  too  large,  then  we  will  overfit,  and  if  too  small,  the  tree  will  not  capture  important 
features.  The  tree  size  is  therefore  acting  as  a  tuning  parameter  that  determines 
complexity.  By  analogy  with  forward  selection,  growing  a  tree  until  (say)  the  sum 
of  squares  is  not  significantly  reduced  in  size  is  shortsighted,  since  splits  below  the 
current  tree  may  be  highly  beneficial.  In  practice,  a  common  approach  is  to  first  grow 
a  large  tree,  7’(l,  stopping  when  some  minimum  node  size  is  reached  (in  the  extreme 
case  we  could  continue  until  each  leaf  contains  a  single  observation);  the  tree  is 
then  pruned  back.  The  space  of  trees  becomes  large  very  quickly,  as  k  increases. 
Consequently,  searching  over  all  subtrees  and  using,  for  example,  cross-validation 
or  AIC  to  select  the  “best”  is  not  feasible.  We  discuss  an  alternative  way  to  penalize 
overfitting. 

Let  T  be  a  subtree  of  1],  that  is  obtained  by  weakest-link  pruning  To,  that  is,  by 
collapsing  any  number  of  its  internal  (nonterminal)  nodes.  We  let 


(12.18) 


denote  the  within-partition  residual  sum  of  squares  and  \T\  be  the  number  of 
terminal  nodes  in  T.  With  respect  to  (12.17),  J  =  |T|.  Following,  Breiman  et  al. 
(1984)  define  the  cost  complexity  criterion  as  the  total  residual  sum  of  squares  plus 
a  penalty  term  that  consists  of  a  smoothing  parameter  A  multiplied  by  the  size  of  the 
tree: 


|T| 


Cx(T)  =  J2njSj(T)  +  X\T\. 


(12.19) 
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Hence,  we  have  a  penalized  sum  of  squares.  For  a  given  A,  we  can  find  the  subtree 
Tx  e  To  that  minimizes  CX(T).  The  tuning  parameter  A  >  0  obviously  balances 
the  tree  size  and  the  goodness  of  fit  of  the  tree  to  the  data,  with  larger  values  giving 
smaller  trees.  As  usual  we  are  encountering  the  bias-variance  trade-off.  Large  trees 
exhibit  low  bias  and  high  variance,  with  complementary  behavior  being  exhibited 
by  small  trees.  With  A  =  0,  we  obtain  the  full  tree,  Tq. 

For  each  A,  it  can  be  shown  that  there  exists  a  unique  smallest  subtree,  Tx  that 
minimizes  CX(T).  See  Breiman  et  al.  (1984)  and  Ripley  (1996)  for  details.  This 
tree  can  be  found  using  weakest-link  pruning.  The  estimation  of  the  smoothing 
parameter  A  may  be  carried  out  via  cross-validation  to  give  a  final  tree  Tx. 
Specifically,  cross-validation  splits  are  first  formed,  and  then,  for  a  given  A,  the 
tree  that  minimizes  (12.19)  can  be  found.  For  this  tree,  the  cross-validation  sum  of 
squares  {y  —  y)2  can  be  calculated  over  the  left-out  data  y,  where  y  is  the  prediction 
from  the  tree.  This  procedure  is  carried  out  for  different  values  of  A,  and  one  may 
pick  the  value  that  minimizes  the  sum  of  squares. 

Before  moving  to  an  example,  we  make  some  general  comments  about  regression 
trees.  See  Hastie  et  al.  (2009,  Sect.  9.2.4)  and  Berk  (2008,  Chap.  3)  for  more 
extensive  discussions. 

In  applications  there  are  often  missing  covariate  values.  One  approach  to 
accommodating  such  values  that  is  applicable  to  categorical  variables  is  to  create  a 
“missing”  category;  this  may  reveal  that  responses  with  some  missing  values  behave 
differently  to  those  without  missing  values.  Another  approach  is  to  drop  cases  down 
the  tree,  as  far  as  they  will  go,  until  a  decision  on  a  missing  value  is  reached.  At  that 
point,  the  mean  of  y  can  be  calculated  from  the  other  cases  available  at  this  node  and 
can  be  used  as  the  prediction.  This  can  result  in  decisions  being  made  based  on  little 
information,  however.  A  general  alternative  strategy  is  to  create  surrogate  variables 
that  mimic  the  behavior  of  the  missing  variables.  When  considering  a  predictor 
for  a  split  only,  the  non-missing  values  are  used.  Once  the  best  predictor/split 
combination  is  selected,  other  predictor/split  points  are  examined  to  see  which  best 
mimics  the  one  selected.  For  example,  suppose  the  optimal  split  based  on  the  non¬ 
missing  observations  is  based  on  X\.  The  binary  outcome  defined  by  the  split  on 
X\  is  then  taken  as  response,  and  we  try  to  predict  this  variable  using  splits  based 
on  each  of  xi,  l  =  2, . . . ,  k.  The  classification  rate  is  then  examined  with  the  best, 
second  best,  etc.,  surrogates  being  recorded.  When  training  data  with  missing  values 
on  x\  are  encountered,  one  of  the  surrogates  is  then  used  instead,  with  the  variable 
chosen  being  the  one  that  is  available  with  the  best  classification  rate.  The  same 
strategy  is  used  for  new  cases  with  missing  values.  The  basic  idea  is  to  exploit 
correlation  between  the  covariates.  The  best  advice  with  regard  to  missing  data  is 
obvious:  one  should  avoid  having  missing  values  as  much  as  possible  when  the 
study  is  conducted,  and  if  there  are  a  large  proportion  of  missing  values  predictions 
should  be  viewed  with  a  fair  amount  of  skepticism,  whatever  the  correction  method 
employed.  A  number  of  authors  have  pointed  out  that  variables  having  more  values 
are  favored  in  the  splitting  procedure  (e.g.,  Breiman  et  al.  1984,  p.  42).  Variables 
with  more  missing  values  are  also  favored  (Kim  and  Loh  2001). 
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An  undesirable  aspect  of  the  fitted  surface  is  that,  by  construction,  it  is  piecewise 
constant,  which  will  often  not  be  plausible  a  priori.  Multiple  adaptive  regression 
splines  (MARS,  to  be  described  in  Sect.  12.7.2)  constructs  a  basis  function  from 
linear  segments,  in  order  to  alleviate  this  problem.  The  flexibility  of  trees  can  be  a 
disadvantage  since  one  cannot  build  in  structure  which  one  might  think  is  present. 
For  example,  consider  an  additive  model  with  two  variables.  Specifically,  suppose 
the  true  model  is  E[Y  |  xi,  x2]  =  j3\I(x\  <  t\)  +  f32I(x2  <  £2)  and  the  first  split 
is  at  Xi  «  t  \ .  Then,  two  subsequent  splits  would  be  needed,  one  on  each  branch  at 
x2  «  t2. 

Fundamentally,  carrying  out  inference  with  regression  trees  is  difficult,  because 
one  needs  to  consider  the  stepwise  nature  of  the  search  algorithm  (Sect.  4.8.1 
discussed  the  inherent  difficulties  of  such  an  approach).  One  solution,  based  on 
permutation  methods,  has  been  suggested  by  Hothorn  et  al.  (2006).  Gordon  and 
Olshen  (1978)  and  Gordon  and  Olshen  (1984);  Olshen  (2007)  (among  others) 
have  produced  results  on  the  conditions  under  which  tree-based  approaches  are 
consistent. 

A  major  problem  with  trees  is  that  they  can  exhibit  high  variance,  in  the  sense 
that  a  small  change  in  the  data  can  result  in  a  very  different  tree  being  formed. 
The  hierarchical  nature  of  the  algorithm  is  responsible  for  this  behavior,  since  the 
effect  of  changes  is  propagated  down  the  tree.  Later  in  the  chapter  we  will  describe 
bagging  (Sect.  12.8.5)  and  random  forest  (Sect.  12.8.6)  approaches  that  consider 
collections  of  trees  in  order  to  alleviate  this  instability. 

We  now  illustrate  the  use  of  regression  trees  with  the  prostate  cancer  data.  In 
Sect.  12.8.4  we  consider  tree-based  approaches  to  classification. 


Example:  Prostate  Cancer 

We  fit  a  binary  regression  tree  model  treating  log  PSA  as  the  response  and  with 
the  splits  based  on  the  eight  covariates.  We  grow  the  tree  with  a  requirement  that 
there  must  be  at  least  three  observations  in  each  leaf.  This  specification  leads  to  a 
regression  tree  with  27  splits. 

We  now  choose  the  smoothing  parameter  A  based  on  cross-validation  and 
minimizing  (12.19),  with  weakest-link  pruning  being  carried  out  for  each  candidate 
value  of  A.  Figure  12.10  plots  the  cross-validation  score  (along  with  an  estimate 
of  the  standard  error)  as  a  function  of  “complexity”  (on  the  bottom  axis)  and  tree 
size  (top  axis).  The  complexity  score  here  is  the  improvement  in  R 2  (Sect.  4.8.2) 
when  the  extra  split  is  made.  The  tree  that  attains  the  minimum  CV  is  displayed  in 
Fig.  12.1 1  and  has  four  splits  and  five  leaves  (terminal  nodes).  We  saw  in  Fig.  10.7 
that  when  the  lasso  was  used,  log  cancer  volume  was  the  most  important  variable  (in 
the  sense  of  being  the  last  to  be  removed  from  the  model),  followed  by  log  weight 
and  SVI.  Consequently,  it  is  no  surprise  that  two  of  the  splits  are  on  log  cancer 
volume,  with  one  each  for  log  weight  and  SVI. 
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Fig.  12.10  Cross-validation  score  versus  complexity,  as  measured  by  tree  size  (top  axis)  and 
improvement  in  R2  (bottom  axis),  for  the  prostate  cancer  data 

The  final  model  is 

5 

/(*)  ='52fijhj(x), 

j= 1 

where  the  numerical  values  of  j3j  are  given  in  Fig.  12.11  and 
hi(x)  =  /(lcavol  <  —0.4786) 

Ii2(x )  =  /(lcavol  >  —0.4786)  x  /(lcavol  <  2.462)  x  /(lweight  <  3.689)  x  /( svi  <0.5) 
hs(x)  =  /(lcavol>  —  0.4786)  x  /(lcavol<2.462)  x  /(lweight<3.689)  x  /(svi>0.5) 
hi{x)  =  /(lcavol>— 0.4786)  x  /(lcavol<2.462)  x  /(lweight>3.689) 
hs{x)  =  /(lcavol>2.462). 

In  terms  of  assigning  a  prediction  to  a  new  observation  with  covariates  x,  we  simply 
read  down  the  tree  in  Fig.  12.11. 


622 


12  Nonparametric  Regression  with  Multiple  Predictors 


Fig.  12.11  Hierarchical 
regression  tree  for  the 
prostate  cancer  data.  For  each 
leaf  we  give  the  estimated 
mean  response  and  the 
number  of  observations 
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12. 7.2  Multiple  Adaptive  Regression  Splines 

We  briefly  describe  the  multiple  adaptive  regression  splines  (MARS)  algorithm 
that  combines  stepwise  linear  regression  with  a  spline/tree  model;  MARS  was 
introduced  in  Friedman  (1991).  MARS  overcomes  the  discreteness  of  the  regression 
trees  fitted  model  by  using  piecewise  linear  basis  functions  of  the  form  (xj  —  t)+ 
and  (t  —  x  j)+  for  j  =  1 ,k;  these  are  known  as  a  reflected  pair.  Here,  x:j  refers 
to  a  generic  covariate,  and  t  to  an  observed  value  of  that  covariate.  Hence,  we  have  a 
pair  of  linear  truncated  line  segments,  which  we  have  already  seen  used  as  building 
blocks  for  splines  in  Sect.  1 1.2.1.  The  collection  of  basis  functions  is 

{  (xi  -  t)+,  ( t  -  xi)+,  t  e  {xu,  ...,xni},  l  =  1, . . . ,  k  }  .  (12.20) 

If  all  of  the  covariates  are  distinct,  there  are  2 nk  basis  functions  in  total. 

The  model  is  of  the  form 


J 

f{x)  =  Pq  +  ^2/3 jhj(x) 
i= i 

where  each  hj(x)  is  a  particular  reflected  pair  from  the  collection  (12.20)  or  a 
product  of  two  or  more  pairs.  To  select  basis  functions,  forward  selection  is  used 
(Sect.  4.8.1).  At  a  particular  step  suppose  we  have  functions  hflx),  l  =  1, . . . ,  L  in 
the  current  model.  We  then  add  the  term  of  the  form 

/3L+1hi(x)  x  (xv  -  t)+  +  flL+2hi(x)  x  (t  -  xi')+ 
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Fig.  12.12  Image  plots  of  the  fitted  surfaces  for  the  ethanol  data  obtained  from  (a)  a  pruned 
regression  tree  and  (b)  the  MARS  algorithm 


that  gives  the  largest  decrease  in  the  residual  sum  of  squares.  As  with  regression 
trees  the  process  is  continued  until  some  preset  maximum  number  of  terms  are 
present.  This  typically  gives  overfitting,  and  so  backward  elimination  (Sect.  4.8.1)  is 
used  to  reduce  the  size  of  the  model  by  removing  one  by  one  the  term  that  gives  the 
smallest  increase  in  the  residuals  sum  of  squares.  Note  that  whereas  in  the  forward 
direction  the  terms  are  added  in  pairs,  in  the  backward  direction,  single  terms  can 
be  removed.  The  balance  between  the  size  of  the  model  and  the  closeness  of  the 
predictions  to  the  observations  is  decided  upon  using  generalized  cross-validation 
(recall  that  generalized  cross-validation  requires  less  computation  than  ordinary 
cross-validation.  Sect.  10.6.3). 

Like  regression  trees,  MARS  is  effectively  performing  variable  selection.  The 
model  and  parameter  estimates  produced  by  MARS  are  also  quite  interpretable 
(unlike  boosting  and  random  forests,  which  we  will  meet  shortly)  though  inference, 
as  with  regression  trees,  is  not  straightforward. 

MARS  is  particularly  appealing  as  the  dimensionality  of  the  covariate  space 
increases  since,  as  we  saw  in  Sect.  12.3.3,  the  use  of  tensor  products  of  splines 
is  prohibitive  in  higher  dimensions,  as  the  number  of  bases  explodes.  MARS  avoids 
this  problem  by  adaptively  choosing  bases  and  then  carrying  out  a  kind  of  pruning. 
The  manner  in  which  the  terms  are  added  has  a  flavor  of  following  the  hierarchy 
principle  (Sect.  4.8)  since  interaction  terms  are  added  on  top  of  the  main  effects.  A 
more  detailed  discussion  of  MARS  can  be  found  in  Hastie  et  al.  (2009,  Sect.  9.4). 


Example:  Ethanol  Data 

We  applied  both  regression  trees  and  the  MARS  approach  to  the  ethanol  data. 
The  pruned  regression  tree  had  five  splits  with  four  involving  the  E  variable.  The 
resultant  fitted  surface  is  shown  in  Fig.  1 2. 1 2(a)  with  the  discreteness  being  apparent 
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and  undesirable.  Applying  the  MARS  method  to  the  ethanol  data  results  in  six  bases 
in  the  model  (in  addition  to  the  intercept),  with  four  of  the  six  involving  the  E 
variable.  The  resultant  fitted  surface  is  shown  in  Fig.  12. 12b  and  is  far  more  visually 
appealing  than  the  regression  tree  surface. 


12.8  Classification 

The  classification  problem  is  to  predict  the  class  of  a  response,  given  covariates  x, 
from  the  set  {0, 1, . . . ,  K—  1}.  The  true  outcome  is  denoted  Y  and  the  classification 
Y  =  g(x),  where  g ( ■ )  is  the  classifier.  There  are  many  approaches  to  classification, 
and  we  will  only  scratch  the  surface  in  this  section.  More  extensive  treatments  are 
referenced  in  Sect.  12.10. 

We  distinguish  two  broad  approaches.  In  the  first,  we  fit  (or  train)  a  model  Y  \  x, 
with,  for  example,  E [Y  \  x]  =  f(x)  for  a  class  of  functions  /(•).  A  classification  is 
then  made  on  the  basis  of  the  fitted  model.  The  spline  and  kernel  generalized  linear 
model  methods  discussed  in  Sect.  11.5  are  clearly  applicable,  if  we  model  the  data 
as  multinomial.  For  example,  logistic  smoothers  may  be  used  in  the  binary  case. 
The  fitted  values  f(x)  can  be  simply  converted  to  classifications  g(x),  for  example, 
using  the  Bayes  classifier  that  assigns  an  observation  to  the  class  with  the  highest 
probability  (Sect.  10.3.2). 

In  the  second  approach,  we  reverse  the  conditioning  and  model  X  \  y.  Suppose 
initially  that  for  class  k  the  distribution  of  x  is  known  with  the  prior  probabilities 
of  class  k  being  tt^,  k  =  0, 1, . . . ,  K  —  1.  Then  the  posterior  probability  that  a  case 
with  covariates  x  is  of  class  k  is 


Pr(lr  =  k  |  x)  = 


Pk{ x)  X  7 Tfc 
\~~\K — 1  /  \  5 

E;= o  Pi\x)  x  n 


(12.21) 


wherepfc(at)  is  the  distribution  of  x  for  class  k.  Given  class  probabilities,  we  wish  to 
decide  upon  a  classification.  Minimization  of  the  expected  prediction  error  (which  is 
the  expected  loss  with  equal  misclassihcation  losses)  gives  the  classifier  that  assigns 
the  class  that  maximizes  the  posterior  probability  (Sect.  10.4.2). 

We  may  draw  an  analogy  with  Bayes  model  selection,  as  described  in  Sect.  4.3.1. 
For  simplicity,  suppose  we  have  to  decide  between  just  two  actions:  in  the  model 
selection  context,  two  models,  and  in  the  classification  context,  two  classes.  In  the 
former,  model  Mi  is  preferred  if 


Pr(Mi  |  y)  _  p(y  \  Mi)  7Ti  U 
Pr (M0  |  y)  p(y  \  M0)  *  n0  Ln’ 
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where  L,  and  Ln  are  the  losses  associated  with  type  I  and  type  II  errors  and  ttq  and 
7Ti  are  the  prior  probabilities  of  Mo  and  M\.  In  the  classification  context,  we  classify 
to  class  1  if 


Flf  =  1  |  ^  |  r  =  11  x  5-  >  M,  ,12.22) 

Pr(Y  =  0  |  a;)  p( x  |  Y  =  0)  7 r0  L{  1,  0) 

where  L( 0, 1)  is  the  loss  associated  with  assigning  g{x)  =  1  when  the  truth  is 

Y  =  0  and  L(  1,  0)  is  the  loss  associated  with  assigning  g{x)  =  0  when  the  truth  is 

Y  =  1  and  7r0  and  7Ti  are  the  prior  probabilities  of  Y  =  0  and  Y  =  1  (Table  10.1). 

Now  that  we  have  briefly  described  the  two  basic  approaches,  we  outline  the 

structure  of  this  section.  In  Sect.  12.8.1,  we  briefly  describe  a  multinomial  version 
of  the  logistic  model  that  may  be  used  for  more  than  K  =  2  categories.  We  then 
proceed  to  describe  two  methods  for  modeling  p(X  \  y),  linear  and  quadratic 
discriminant  analysis  in  Sect.  12.8.2  and  kernel  density  estimation  in  Sect.  12.8.3. 
The  former  is  a  parametric  method  based  on  normal  distributions  for  the  distribution 
of  X  |  y,  and  the  latter  is  nonparametric.  Turning  to  the  approach  of  directly 
modeling  Y  \  x,  we  describe  classification  trees,  bagging,  and  random  forests  in, 
respectively,  Sects.  12.8.4-12.8.6. 


12.8.1  Logistic  Models  with  K  Classes 


We  describe  extensions  to  logistic  regression  modeling  when  K  >  2,  and  the 
categories  are  nominal,  that  is,  have  no  ordering.  For  I\  classes  we  may  specify 
the  model  in  terms  of  K  —  1  odds  where,  for  simplicity,  we  assume  univariate  x: 


Pr(Y  =  k  |  x) 
Pr(Y  =  K-l\x) 


=  exp(/30fe  +  Pikx), 


(12.23) 


to  give 


Pr(Y  =  k  |  x) 


exp  (/3ofc  +  ffi  kx) 

1  +  Ez=o2  exP(A)z  +  Pux)  ’ 


k  =  0,...,K-2, 

(12.24) 


with  Pr(Y  =  K  —  1  |  x)  =  1  —  Pr(Y  =  k  \  x)  (Exercise  12.3).  The  use  of 

the  last  category  as  reference  is  arbitrary,  and  the  particular  category  chosen  makes 
no  difference  for  inference.  If  we  do  wish  to  interpret  the  parameters,  then  exp(/?ofc) 
is  the  baseline  probability  of  Y  =  k,  relative  to  the  probability  of  the  final  category, 
Y  =  K  —  1,  so  that  we  have  a  specific  generalization  of  odds.  The  parameter 
exp(/3ifc)  is  the  odds  ratio  that  gives  the  multiplicative  change  associated  with  a 
one-unit  increase  in  x  in  the  odds  of  response  k  relative  to  the  odds  of  response 


626 


12  Nonparametric  Regression  with  Multiple  Predictors 


K  —  1.  We  emphasize  that  in  a  classification  (prediction)  setting,  we  will  often  have 
little  interest  in  the  model  coefficients. 

In  terms  of  nonparametric  modeling,  one  may  model  the  collection  of  K  —  1 
logits  as  smooth  functions,  along  with  a  multinomial  likelihood.  For  example,  a 
simple  model  is  of  the  form 


Pr(y  =  k  |  x) 
Pr(Y  =  K-l\x) 


fk(x ) 


with  smoothers  (such  as  splines  or  local  polynomials)  /&(•),  k  =  0, .  ..,K  —  2. 
Yee  and  Wild  (1996)  describe  how  GAMs  can  be  extended  to  this  situation  using 
penalized  spline  models  and  also  describe  the  extension  to  ordered  classes. 


12.8.2  Linear  and  Quadratic  Discriminant  Analysis 


If  we  wish  to  follow  the  approach  summarized  in  (12.21),  then  a  key  element  is 
clearly  the  specification  of  the  distribution  of  the  covariates  for  each  of  the  different 
classes.  In  this  section  we  assume  these  distributions  are  multivariate  normal. 
In  a  slight  change  of  notation  from  previous  sections,  we  assume  the  dimensionality 
of  x  is  p.  We  begin  by  assuming  that  X  |  y  =  k  ~  Np(/xfc,  £)  so  that  the 
p  x  p  covariance  matrix  is  common  to  all  classes.  The  within-class  distribution  of 
covariates  is  therefore 


Pk(x) 


(27r)-p/2|i:r1/2exp 


VkYz-\x 


for  k  =  0, 1, . . . ,  K  —  1.  From  (12.21)  we  see  that  maximizing  Pr(Y  =  k  \  x)  over 
k  is  equivalent  to  minimizing  —  log  Pr(Y  =  k  \  x),  i.e.,  minimizing 

ix  ~  VkY S-1  (x  -  p,k)  -  2\o g7Tfe,  (12.25) 

where  the  first  term  is  the  Mahalanobis  distance  (Malahanobis  1936)  between  x  and 
\t,k.  If  the  prior  is  uniform  over  0,1,...,  K  —  1,  then  we  pick  the  class  that  minimizes 
the  within-class  sum  of  squares.  Expanding  the  square  in  (12.25),  it  is  clear  that  the 
term  xT23~1x,  which  depends  on  £  and  not  k ,  can  be  ignored.  Consequently,  we 
see  that  the  above  rule  is  equivalent  to  picking,  for  fixed  x,  the  class  k  that  minimizes 


Ufc  -(-  x  bk 


(12.26) 


ak  =  nTk£  Vfc  -  21og7rfc 

bk  =  1  fJ'k 


where 
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for  k  =  0, . . . ,  K  —  1.  Hence,  we  have  a  set  of  K  linear  planes,  and  for  any  x ,  we 
pick  the  class  k  whose  plane  at  that  a;  is  a  minimum.  Said  another  way,  we  have  a 
decision  boundary  that  is  linear  in  x,  and  the  method  is  therefore  known  as  linear 
discriminant  analysis  (LDA).  The  decision  boundary  between  classes  k  and  /,  that 
is,  where  Pr(Y  =  k  |  x)  =  Pr(Y  =  l  \  x)  is  linear  in  x  and  the  regions  in 
that  are  classified  according  to  the  different  classes,  0, 1, . . . ,  K  —  1,  are  separated 
by  hyperplanes.  An  example  of  the  linear  boundaries,  in  the  case  of  univariate  x,  is 
given  in  Fig.  12.15. 

The  parameters  of  the  normal  distributions  are,  of  course,  unknown  and  may  be 
estimated  from  the  training  data  via  MLE: 


nk 

ttk  — 

n 

fik  =  —  'S''  : 

nk 

i:yi=k 


(12.27) 

(12.28) 


£  =  n  R  5Z  (*i  -  -  VkY  (12.29) 

fc=0  i:yi=k 


where  nk  is  the  number  of  observations  with  Y  =  k,  k  =  0, 1, . . . ,  K  —  1,  and  n  = 
J2k  nk-  To  estimate  Ttk  from  the  data,  as  in  (12.27),  depends  on  a  random  sample 
of  observations  having  been  taken.  Otherwise,  we  might  use  prior  information  to 
specify  class  probabilities. 

We  now  relax  the  assumption  that  the  covariance  matrices  are  equal.  In  this  case, 
we  pick  k  that  minimizes 

log  |^|  +  (®  -  HkYS^ix  -  fik)  -  21og7 rfc) 

as  shown  by  Smith  (1947).  Expanding  the  quadratic  form  gives  a  term  xJSk1x 
which  cannot  be  ignored  since  the  variance-covariance  matrix  depends  on  k:  hence, 
the  method  is  known  as  quadratic  discriminant  analysis  (QDA).  Again,  we  need  to 
estimate  the  parameters,  with  the  estimators  for  Ttk  and  fik  corresponding  to  (12.27) 
and  (12.28)  with 

=  —  V]  (xi  -  fik)(xi  - ) uky 

71 .  z ' 


for  k  =  0, 1, . . . ,  K  —  1. 

We  now  examine  the  connection  between  logistic  regression  and  LDA  in  the  case 
of  two  classes,  that  is,  I\  =  2.  Under  LDA  we  define  the  log  odds  function 


L{  x) 


Pr(y  =  1  |  x) 
Pr(Y  =  0|a:) 


=  log 


x(Mi+Mo Ys  1(Mi-Mo)  +  i:  Hmi-Mo Yx, 

A  v _ v _ ^ 

- V- - /  CX.  1 

CXQ 
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so  that  the  function  upon  which  classifications  are  based  has  the  linear  form  ao  + 
oliX.  If  the  losses  associated  with  the  two  types  of  errors  are  equal,  then  we  assign 
a  case  with  covariates  x  to  class  Y  =  1  if  L(x)  >  0,  see  (12.22).  Notice  that  x  only 
enter  through  the  term  £  [(jit  —  /i,0)T.  Under  (linear)  logistic  regression. 


log 


Pr(F  =  1\  x) 
Pr(y  =  0|a:) 


Po  +  Plx- 


Consequently,  the  rules  are  both  linear  in  x,  but  differ  in  the  manner  by  which  the 
parameters  are  estimated.  In  general,  we  may  factor  the  distribution  of  Xi,yi  in 
two  ways,  which  correspond  to  the  two  approaches  to  classification  that  we  have 
highlighted.  Modeling  the  x  distributions  corresponds  to  the  factorization 


n  n  n 

Tbtxun)  =  Y[p(xi  I  yi)\\p{yi)- 

i=  1  i= 1  i=  1 


For  example,  under  LDA,  it  is  assumed  that  p(xi  |  yt )  is  normal,  and  then 
n;_  |  p{xi-  yi)  is  maximized  with  respect  to  the  parameters  of  the  normals.  In 
contrast,  under  linear  logistic  regression  the  factorization  is 


iik**.*) = i  *011^*0, 

2=1  2=1  2=1 

and  we  maximize  the  first  term,  under  the  assumption  of  a  linear  logistic  model, 
while  ignoring  the  second  term.  Logistic  regression  therefore  leaves  the  marginal 
distribution  p(x)  unspecified,  and  so  the  method  is  more  nonparametric  than  LDA, 
which  is  usually  an  advantage.  Asymptotically,  there  is  a  30%  efficiency  loss  when 
the  data  are  truly  multivariate  normal  but  are  analyzed  via  the  logistic  regression 
formulation  (Efron  1975). 

The  original  derivation  of  LDA  in  Fisher  (1936)  was  somewhat  different  to 
the  presentation  given  above  and  was  carried  out  for  K  =  2.  Specifically,  a 
linear  combination  aTx  was  sought  that  separated  (or  discriminated  between)  the 
classes  as  much  as  possible  to,  “maximize  the  ratio  of  the  difference  between  the 
specific  means  to  the  standard  deviations”,  Fisher  (1936,  p.  466).  This  difference  is 
maximized  by  taking  a  oc  S  —  //0)T,  an  expression  we  have  already  seen, 

see  Exercise  12.4  for  further  detail. 


Example:  Bronchopulmonary  Dysplasia 

We  return  to  the  BPD  example  and  classify  individuals  on  the  basis  of  their 
birth  weight  using  linear  and  quadratic  logistic  models,  and  linear  and  quadratic 
discriminant  analysis.  We  emphasize  that  these  rules  are  relevant  to  the  sampled 
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Fig.  12.13  Logistic  linear  and  quadratic  fits  to  the  BPD  and  birth  weight  data.  The  horizontal  line 
indicates  p(x)  =  0.5,  and  the  two  vertical  lines  correspond  to  the  linear  and  quadratic  logistic 
decision  rules 


population  of  children  for  whom  data  were  collected  and  not  to  the  general 
populations  of  newborn  babies.  This  is  important  since  this  is  far  from  a  random 
sample,  and  so  the  estimate  of  the  probability  of  BPD  (the  outcome  of  interest)  is 
a  serious  overestimate.  In  general  this  example  is  illustrative  of  techniques  rather 
than  substantively  of  interest,  not  least  because  of  the  lack  of  other  covariates  that 
one  would  wish  to  base  a  classification  rule  upon  (including  medications  used  by 
the  mother). 

Figure  12.13  shows  the  linear  and  quadratic  logistic  regression  fits  as  a  function 
of  x.  The  horizontal  p(x)  =  0.5  line  is  drawn  in  gray,  and  we  see  little  difference 
between  the  classification  rules  based  on  the  two  logistic  models.  The  birth  weight 
thresholds  below/above  which  individuals  would  be  classified  as  BPD/not  BPD,  for 
the  linear  and  quadratic  models,  are  954  g  and  926  g,  respectively.  The  fitted  curves 
are  quite  different  in  the  tails,  however.  In  particular,  we  see  that  the  quadratic  curve 
seems  to  move  toward  a  nonzero  probability  for  higher  birth  weights,  a  feature  we 
have  seen  in  other  analyses.  For  example,  the  penalized  cubic  splines  and  local 
likelihood  fits  shown  in  Fig.  11.16  display  this  behavior.  The  similarity  between 
the  classification  rules,  even  though  the  models  are  quite  different,  illustrates  that 
prediction  is  a  different  enterprise  to  conventional  modeling. 

Turning  to  a  discriminant  analysis  approach,  Figures  12.14(a)  and  (b)  display 
normal  QQ  plots  (Sect.  5.1 1.3)  of  the  birth  weights  for  the  BPD  =  0  and  BPD  =  1 
groups,  respectively.  The  babies  with  BPD  in  particular  have  birth  weights  which 
do  not  appear  normal. 

The  parameter  estimates  for  LDA  and  QDA  are 


7r0  =  0.66 

Jlo  =  1,287,  ytti  =  953 
S  =  76,309,  £0  =  77,147,  =  74,677. 
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Theoretical  Quantiles  Theoretical  Quantiles 

Fig.  12.14  (a)  Normal  QQ  plot  of  birth  weights  for  the  BPD=0  group,  (b)  normal  QQ  plot  of 
birth  weights  for  the  BPD=1  group 

Crucially,  we  see  that  the  variances  within  the  two  groups  (not  BPD/BPD)  are  very 
similar  so  that  we  would  expect  LDA  and  QDA  to  give  very  similar  answers  in  this 
example.  This  is  indeed  the  case  as  the  linear  and  quadratic  discriminant  boundaries 
are  at  birth  weights  of  970  and  972  g,  respectively. 

Figure  12.15  gives  the  lines  that  are  proportional  to  —  21ogPr(y  =  k  \  x)  for 
k  =  0, 1  (with  x  the  birth  weight  here),  that  is  the  lines  given  by  (12.26).  The 
crossover  point  gives  the  birth  weight  at  which  we  switch  from  a  classification  of 
Y  =  1  to  a  classification  of  Y  =  0.  Figure  12.16  shows  the  fitted  normals  under  the 
model  with  differing  variances. 

There  are  only  small  differences  in  this  example,  because  the  within-class  birth 
weights  are  not  too  far  from  normal,  the  variances  in  each  group  are  approximately 
equal,  and  the  sample  sizes  are  relatively  large. 


12.8.3  Kernel  Density  Estimation  and  Classification 

We  now  describe  a  nonparametric  method  for  classification  based  on  kernel  density 
estimation  (Sect.  1 1.3.2).  With  estimated  densities  Pk(x),  the  classification  is 

Pr(F  =  k\x)=  - • 

Et=o  Pl(x)  x  ^ 

When  classification  is  the  goal,  then  effort  should  be  concentrated  estimating 
the  class  probabilities  Pr(lr  k  \  x)  accurately  near  the  decision  boundary. 
As  we  saw  in  Sect.  11.3.2,  the  crucial  aspect  of  kernel  density  estimation  is  an 
appropriate  choice  of  smoothing  parameter  with  the  form  of  the  kernel  being  usually 
unimportant. 

Kernel  density  estimation  is  hard  when  the  dimensionality  p  of  x  is  large.  The 
naive  Bayes  method  assumes  that,  given  a  class  Y  =  l,  the  random  variables 
A'i, . . . ,  Xp  are  independent  to  give  joint  distribution 
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Fig.  12.15  Linear  discriminant  boundaries  (12.26)  for  the  two  BPD  groups  with  k  =  0/1 
representing  no  disease/disease,  the  rule  is  based  on  whichever  of  the  two  lines  is  lowest.  The 
vertical  line  is  the  decision  boundary  so  that  to  the  left  of  this  line  the  classification  is  to  disease 
and  to  the  right,  to  no  disease 


V 

Pl{x)  =  II  Pli&j)' 

i= i 


(12.30) 


The  naive  Bayes  method  is  clearly  based  on  heroic  assumptions  but  is  also  clearly 
simple  to  apply  since  one  need  only  compute  p  univariate  kernel  density  estimates 
for  each  class.  An  additional  advantage  of  the  method  is  that  elements  of  x  that 
are  discrete  may  be  estimated  using  histograms,  allowing  the  simple  combination  of 
continuous  and  discrete  variables.  Taking  the  logit  transform  of  (12.30),  as  described 
in  Sect.  12.8.1,  we  obtain 


log 


Pr(Y  =  l  |  x) 

7 TlPl(x) 

Pr(y  =  K-1  I  x) 

log 

TtK-lPK-l(x)  _ 

=  log 
=  log 


70  YT^iPijixj) 
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Fig.  12.16  For  the  BPD  data,  fitted  normal  distributions  with  different  variances  for  each  of  the 
two  classes  (i.e.,  Ho  /  Hi)  and  with  areas  proportional  to  n\  and  no  (for  the  left  and  right  normals, 
respectively);  the  dashed  line  represents  the  quadratic  discrimination  rule  and  corresponds  to  the 
crossover  point  of  the  two  densities.  The  dashes  on  the  top  and  bottom  axes  represent  the  observed 
birth  weights  for  those  babies  with  and  without  BPD 


which  has  the  form  of  a  GAM  (Sect.  12.2)  and  provides  an  alternative  method  of 
estimation  to  kernel  density  estimation  under  the  assumption  of  independence  of 
elements  of  x  in  different  classes.  The  same  form  of  decision  rule  arising  via  two 
separate  estimation  approaches  is  similar  to  that  seen  when  comparing  LDA  and 
logistic  regression. 


Example:  Bronchopulmonary  Dysplasia 


We  illustrate  the  use  of  kernel  density  estimation  in  a  one-dimensional  setting  using 
the  birth  weight/BPD  example.  The  choice  of  smoothing  parameter  A  is  crucial,  and 
we  present  three  different  analyses  based  on  different  methods.  We  let  Xk  represent 
the  smoothing  parameter  under  classification  k,  k  =  0, 1. 

First,  we  use  the  optimal  A&,  given  by  (1 1.36),  that  arises  under  the  assumption 
that  each  of  the  densities  is  normal.  This  leads  to  estimates  of  Ao  =  108  and 
Ai  =  122.  Figure  12.17(a)  shows  the  estimated  densities  for  both  classes.  The 
non-normality  of  birth  weights  for  the  k  =  0  class,  that  was  previously  seen  in 
Fig.  12.14(a),  is  evident.  The  log  ratio, 


Pr(Y  =  1  |  x) 
Pr(Y  =  0  |  x) 


log 


Pi  0*0 

,Po(x) 


-(-log 


7Tl 

7*0 


is  shown  in  panel  (b)  and  gives  a  decision  threshold  of  x  =  1,162  g. 
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Fig.  12.17  The  left  column  shows  kernel  density  estimates  for  the  birth  weights  under  the  two 
classes  (with  k  =  0/1  corresponding  to  no  BPD/BPD)  and  the  right  column  the  log  of  the  ratio 
Pr(y  =  1  |  x)/Pr(Y  =  0  |  x)  with  the  vertical  line  indicating  the  decision  threshold.  The 
three  rows  correspond  to  choosing  the  smoothing  parameters  based  on  normality  of  the  underlying 
densities,  cross-validation,  and  upon  a  plug-in  method 


We  next  use  cross-validation  to  pick  the  smoothing  parameters  (as  described  in 
Sect.  1 1.3.2).  The  resultant  estimates  are  Ao  =  103  and  Ai  =  28  with  the  resultant 
density  estimates  plotted  in  Fig.  12.17c.  The  estimate  for  the  disease  group  ( k  =  1) 
is  very  unsatisfactory,  though  the  decision  boundary  (as  shown  in  Fig.  12.17(d))  is 
very  similar  to  the  previous  approach,  with  a  threshold  of  x  =  1,083  g. 
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Finally,  we  use  the  plug-in  method  of  Sheather  and  Jones  (1991)  to  pick  the 
smoothing  parameters,  giving  Ao  =  97  and  Ai  =  59.  The  resultant  density  estimates 
are  plotted  in  Fig.  12.17(e).  The  birth  weight  threshold  is  x  =  1,102  g  under  these 
smoothing  parameters  with  the  log  ratio,  shown  in  Fig.  12. 17(f),  being  more  smooth 
than  the  cross-validation  version  but  less  smooth  than  the  normal  version. 


12.8.4  Classification  Trees 

In  this  section  we  consider  how  the  regression  trees  described  in  Sect.  12.7  can 
be  used  in  a  classification  context.  Classification  and  regression  trees,  or  CART,  has 
become  a  generic  term  to  describe  the  use  of  regression  trees  and  classification  trees. 

In  the  classification  setting,  the  criteria  for  splitting  nodes  needs  refinement. 
For  regression,  we  used  the  residual  sum  of  squares  within  each  node  as  the 
impurity  measure  S j(T),  defined  in  (12.18).  This  measure  was  then  used  within  the 
cost  complexity  criterion,  (12.19),  to  give  a  penalized  sum  of  squares  function  to 
minimize.  A  sum  of  squares  is  not  suitable  for  classification,  however  (for  a  variety 
of  reasons,  including  the  nonconstant  variance  aspect  of  discrete  outcomes).  In  order 
to  define  an  impurity  measure,  we  need  to  specify,  for  each  of  the  J  terminal  nodes 
(leaves),  a  probability  distribution  over  the  K  outcomes.  Node  j  represents  a  region 
Rj  with  rij  observations,  and  the  obvious  estimate  of  the  probability  of  observing 
class  k  at  node  j  is 


'3 


which  is  simply  the  proportion  of  class  k  observations  in  node  j,  for  k  =  0, . . . , 
K  —  1,  j  =  1, . . . ,  J.  We  may  classify  the  observations  in  node  j  to  class 


k{j)  =  arg  maxfc  pjk, 


the  majority  class  (Bayes  rule)  at  node  j.  Given  a  set  of  classification  probabilities, 
we  turn  to  defining  a  measure  of  impurity.  In  a  regression  setting,  we  wished  to 
find  regions  of  the  x  space  within  which  the  response  was  relatively  constant, 
and  the  impurity  measure  in  this  setting  was  the  residual  sum  of  squares  about 
the  mean  of  the  terminal  node  in  question.  By  analogy,  we  would  like  the  leaves 
in  a  classification  setting  to  contain  observations  of  the  same  class.  An  impurity 
measure  should  therefore  be  0  if  all  the  probability  at  a  node  is  concentrated  on  one 
class,  that  is,  if  pjk  =  1  for  some  k,  and  the  measure  should  achieve  a  maximum 
if  the  probability  is  spread  uniformly  across  the  classes,  that  is,  if  Pjk  =  1  /K  for 
k  =  0, . . . ,  K  —  1.  Three  different  impurity  measures  are  discussed  by  Hastie  et  al. 
(2009,  Sect.  9.2.3). 
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The  misclassification  error  of  node  j  is  the  proportions  of  observations  at  node  j 
that  are  misclassified: 


—  Y  ^  *0')]  =  1  -  Pk(j),j- 

^  i:Xi£Rj 

The  Gini  index  associated  with  node  j  is 

K-l 

'Y.  PjkPjk 1  =  Pjk(  1  ~  Pjk)- 

k^k'  k= 0 

The  Gini  index  has  an  interesting  interpretation.  Instead  of  assigning  observations 
to  the  majority  class  at  a  node,  we  could  assign  to  class  k  with  probability  pjk-  With 
such  an  assignment,  the  training  error  of  the  rule  at  the  node  is 


K- 1 

Y  Pr(  Truth 

k—0 


K-l 

k)  x  Pr(  Classify  ^k)=  Y,  Pjk(l~Pjk), 

fe= o 


which  is  the  Gini  index.  It  may  be  better  to  use  this  than  the  misclassification  error 
because  it  “has  an  element  of  look  ahead”  Ripley  (1996,  p.  327),  that  is,  it  considers 
the  error  in  a  hypothetical  training  dataset.  The  final  measure  is  the  deviance,  which 
is  just  the  negative  log-likelihood  of  a  multinomial: 

K-l 

~  Y  Pok  log  Pjk- 
k= 0 

This  measure  is  also  known  as  the  entropy.4  The  deviance  and  Gini  index  are 
differentiable  and  hence  more  amenable  to  numerical  optimization. 

For  two  classes,  let  pj  be  the  proportion  in  the  second  class  at  node  j,  for  j  = 
1, . . . ,  J.  In  this  case,  the  misclassification  error,  Gini  index,  and  deviance  measures 
are,  respectively. 


1  —  max(pj ,  1  —  Pj) 

2PjO-  ~Pj) 

- Pj  l°g Pj  -  (1  -  Pj)  log(l  -  Pj) 

The  worst  scenario  is  p  =  0.5  since  we  have  a  50:50  split  of  the  two  classes  in  the 
partition  (and  hence  the  greatest  impurity).  Figure  12.18  graphically  compares  the 
three  measures,  with  the  deviance  scaled  to  pass  through  the  same  apex  point  as 
the  other  two  measures. 


4In  statistical  thermodynamics,  the  entropy  of  a  system  is  the  amount  of  uncertainty  in  that  system, 
with  the  maximum  entropy  being  associated  with  a  uniform  distribution  over  the  states. 
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Fig.  12.18  Comparison  of  impurity  measures  for  binary  classification.  The  classes  are  labeled  0 
and  1  and  pj  is  the  proportion  in  the  second  class  (i.e.,  k  =  1)  at  node  j 


12.8.5  Bagging 

We  previously  noted  in  Sect.  12.7.1  that  regression  trees  can  be  unstable,  in  the 
sense  that  a  small  change  in  the  learning  data  can  induce  a  large  change  in  the 
prediction/classification.  Classification  trees  can  also  produce  poor  results  when 
there  exist  heterogeneous  terminal  nodes,  or  highly  correlated  predictors. 

Bootstrap  aggregation  or  bagging  (Breiman  1996)  averages  predictions  over 
bootstrap  samples  (Sect.  2.7)  in  order  to  overcome  the  instability.  The  intuition 
is  that  idiosyncratic  results  produced  by  particular  trees  can  be  averaged  away, 
resulting  in  more  stable  estimation.  Although  bagging  is  often  implemented  with 
regression  or  classification  trees,  it  may  be  used  with  more  general  nonparametric 
techniques.  To  demonstrate  the  variability  in  tree  construction;  Figs.  12.19(a)-(c) 
show  three  pruned  trees  based  on  three  bootstrap  samples  for  the  prostate  cancer 
data.  The  first  splits  in  (a)  and  (b)  are  on  log  cancer  volume  but  at  very  different 
points,  while  in  (c),  the  first  split  is  on  SVI.  The  variability  across  bootstrap  samples 
is  apparent. 

As  usual,  let  [x,t. i  =  1 , . . . ,  n  denote  the  data.  The  aim  is  to  form  a 
prediction,  f(x0)  =  E[Y  |  a;o]  at  a  covariate  value  Xq.  Bagging  proceeds  as 
follows: 

1 .  Construct  B  bootstrap  samples 


Wb,vt]  =  {x*bi,ybi,i  =  1,  •  •  ■  ,n}, 
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Fig.  12.19  Three  pruned  trees  for  the  prostate  cancer  data,  based  on  different  bootstrap  samples 


for  b  =  1, . . . ,  B.  The  bootstrap  samples  are  formed  by  resampling  cases 
(Sect.  2.7.2),  that  is,  we  sample  with  replacement  from  [ Xi ,  j/j],  *  =  1, . . . ,  n. 

2.  If  the  outcome  is  continuous  (in  which  case,  we  might  use  regression  trees  for 
prediction),  form  the  averaged  prediction 

1  B  ^ 

/b(*o)  =  gYl  fb(x  °)>  (12.31) 

°  b=l 

where  f£(x)  is  the  prediction  constructed  from  the  fo-th  bootstrap  sample 
[a c£,y£],  b  =  1,  ■  •  ■ ,  B.  If  classification  is  the  aim  and  regression  trees  are 
constructed  from  the  samples,  one  may  take  a  majority  vote  over  the  B  samples 
in  order  to  assign  a  class  label. 

If  tree  methods  are  used,  there  is  evidence  that  pruning  should  not  be  carried  out 
(Bauer  and  Kohavi  1999).  By  not  pruning,  more  complex  models  are  fitted  which 
reduces  bias,  and  since  bagging  averages  over  many  models  the  variance  can  be 
reduced  also. 
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We  examine  the  continuous  case  in  greater  detail.  Expression  (12.31)  is  a 


Monte  Carlo  estimate  of  the  theoretical  bagging  estimate  E(. 


/b*Oo) 


where  the 


expectation  is  with  respect  to  sampling  [x*,y*j  from  P,  which  is  the  empirical 
distribution  having  probability  1/n  at  [xi,yi],i  =  1 , ,n. 

We  now  examine  the  mean  squared  error  in  an  idealized  setting.  Let  P  be  the 
population  from  which  [yt,  a:,],  i  =  1 , ,n  are  drawn.  For  analytical  simplicity, 
suppose  we  can  draw  bootstrap  samples  from  the  population  rather  than  the 
observed  data.  Let  fA(x o)  be  a  prediction  at  x{),  based  on  a  sample  from  P,  and 


/agg^o)  —  EP  fA{x  q) 


be  the  ideal  bagging  estimate  which  averages  the  estimator  over  samples  from  the 
population. 

We  consider  a  decomposition  of  the  MSE  of  the  prediction,  in  a  regression 
setting,  based  on  the  single  sample  estimator,  f*(xo),  only: 


y  ~  /*( x0) 


=  EP 


=  E, 


Y  -  /agg(*q)  +  /agg(£Eo)  -  f*{x o) 


{[E  -  /agg(®o)]2}  +  EP  |  f+{x0)-  fAaa(x0) 


+2E, 


{\X  -  /agg(^o)]}  Ep  {  [f*(x0)  /agg  (*0)]  } 


=  E, 


>E, 


{[Y  -  fAoa(x0)]2} 
{[Y  -  fAQG(x0)}2}  . 


+  EP 


1 2 


f*(x 0)  -  fAaG(x0) 

(12.32) 


Hence,  the  MSE  of  idealized  population  averaging  (which  is  the  expression  in  the 
last  line)  never  increases  the  MSE  of  an  estimate  from  a  single  prediction.  The 
second  term  in  (12.33)  is  the  variability  of  the  estimator  f*(x 0)  about  its  average. 
The  above  decomposition  is  relevant  to  a  regression  setting  but  is  not  valid  for 
classification  (0-1  loss),  and  bagging  a  bad  classifier  can  make  it  even  worse. 

Bagging  is  an  example  of  an  ensemble  learning  method;  another  such  method 
that  we  have  already  encountered  is  Bayesian  model  averaging  (Sect.  3.6).  The 
bagged  estimate  (12.31)  will  differ  (in  expectation)  from  the  original  estimate 
f(x 0),  only  when  f(x 0)  is  a  nonlinear  function  of  the  data.  So,  for  example,  bagged 
prediction  estimates  from  spline  and  local  polynomial  models  that  produce  linear 
smoothers  will  be  the  same  as  those  from  fitting  a  single  model  using  the  complete 
data. 

The  original  motivation  for  bagging  (Breiman  1996)  was  to  reduce  variance. 
However,  bagging  can  also  reduce  (or  increase!)  bias  (Buhlmann  and  Yu  2002).  Bias 
may  be  reduced  if  the  true  function  is  smoothly  varying  and  tree  models  are  used 
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(since  the  averaging  of  step  functions  will  produce  smoothing).  If  the  true  function 
is  “jaggedy,”  bias  can  be  introduced  through  averaging.  As  just  noted,  bagging  was 
originally  designed  to  reduce  variance  and  works  well  in  examples  in  which  the 
data  are  “unstable,”  that  is,  in  situations  in  which  small  changes  in  the  data  can 
cause  large  changes  in  the  prediction.  We  give  an  example  of  a  scenario  in  which 
bagging  can  increase  the  variance.  Suppose  there  is  an  outlying  point  (in  x  space). 
This  point  may  stabilize  the  fit  when  the  model  is  fitted  to  the  complete  data,  and 
if  it  is  left  out  of  a  particular  bootstrap  sample,  the  fitted  values  may  be  much  more 
variable  for  this  sample. 

To  bag  a  tree-based  classifier,  we  first  grow  a  classification  tree  for  each  of 
the  B  bootstrap  samples.  Recall,  we  may  have  two  different  aims:  reporting  a 
classification  or  reporting  a  probability  distribution  over  classes.  Suppose  we  require 
a  classification.  The  bagged  estimate  fB(x)  is  the  AT-vector:  [po(x), . . .  ,px-i(x) } 
where  pk  (x)  is  the  proportion  of  the  B  trees  that  predict  class  k,k  =  0, 1, ... ,  AT—  1. 
The  classification  is  the  k  that  maximizes  Pk(x),  that  is,  the  Bayes  rule.  If  we 
require  the  class-probability  estimates,  then  we  average  the  underlying  functions 
that  produce  the  classifications  gb{x).  We  should  not  average  the  classifications.  To 
illustrate  why,  consider  a  two  class  case.  Each  bootstrap  sample  may  predict  the  0 
class  with  probability  0.51,  and  hence  the  classifier  for  each  would  be  Y  =  0,  but 
we  would  not  want  to  report  the  class  probabilities  as  (1,0). 

The  simple  interpretation  of  trees  is  lost  through  bagging,  since  a  bagged  tree  is 
not  a  tree.  For  each  tree,  one  may  evaluate  the  test  error  on  the  “left-out”  samples 
(i.e.,  those  not  selected  in  the  bootstrap  sample).  On  average,  around  1/3  of  the  data 
do  not  appear  in  each  bootstrap  sample.  These  data  are  referred  to  as  the  “out-of¬ 
bag”  (oob)  estimate.  These  test  estimates  may  be  combined,  removing  the  need  for 
a  test  dataset. 

Bagging  takes  the  algorithmic  approach  to  classification  to  another  level  beyond 
tree-based  methods  and  was  important  historically  as  it  was  an  intermittent  step 
toward  various  other  methods  including  random  forests,  which  we  describe  next. 


12.8.6  Random  Forests 

Random  forests  (Breiman  2001a)  are  a  very  popular  and  easily  implemented 
technique  that  build  on  bagging  by  reducing  the  correlation  between  the  multiple 
trees  that  are  fitted  to  bootstrap  samples  of  the  data. 

The  random  forest  algorithm  is  as  follows: 

1 .  B  bootstrap  samples  of  size  n  are  drawn,  with  replacement,  from  the  original 
data. 

2.  Suppose  there  are  p  covariates,  a  number  to  <C  p  is  specified,  and  at  each  node 
to  variables  are  selected  at  random  from  the  p  available.  The  best  split  from  these 
to  is  used  to  split  the  node. 

3.  Each  tree  is  grown  to  be  large,  with  no  pruning.  We  emphasize  that  a  different 
set  of  to  covariates  is  selected  at  each  split  so  the  input  variables  are  changing 
within  each  tree. 
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Fig.  12.20  Out-of-bag  error 
rate  as  a  function  of  the 
number  of  trees,  for  the 
outcome  after  head  injury 
data 
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Once  this  process  is  completed,  the  output  is  a  collection  of  B  trees.  As  with 
bagging,  in  a  regression  context,  the  prediction  may  be  taken  to  be  the  average  of  the 
fits,  as  in  (12.31),  while  in  a  classification  context  the  majority  vote  may  be  taken. 

There  are  two  conflicting  aims  when  we  consider  the  size  of  m.  Increasing  the 
correlation  between  any  two  trees  in  the  forest  increases  the  forest  error  rate,  but  the 
forest  error  rate  decreases  as  the  strength  of  each  individual  tree  increases.  Reducing 
m  reduces  both  the  correlation  and  the  strength,  while  increasing  m  increases  both. 
We  heuristically  explain  why  reducing  the  correlation  between  predictions  leads  to  a 
lowering  of  the  forest  error  rate.  Suppose  we  wish  to  estimate  a  prediction  at  a  value 
Xq  using  an  average  of  B  predictions.  Let  /AV b(xq)  =  -g-  fb(xo)  and  suppose 

that  the  predictions  each  have  variance  a2  and  pairwise  correlations  p.  Then  it  is 
straightforward  to  show  that  the  variance  of  the  average  is 


The  first  term  decreases  to  zero  as  the  number  of  predictor  functions  increases,  while 
the  second  term  is  a  function  of  the  dependence  between  the  functions.  Hence,  the 
closer  the  predictor  functions  are  to  independence,  the  lower  the  variance. 

As  with  bagging,  when  the  training  set  (i.e.,  the  bootstrap  sample)  for  the  current 
tree  is  drawn  by  sampling  with  replacement,  about  1/3  is  left  out  of  the  sample,  and 
these  form  the  oob  (Sect.  12.8.5).  These  set  aside  data  are  used  to  get  a  running 
unbiased  estimate  of  the  classification  error,  as  trees  are  added  to  the  forest.  An 
example  of  such  a  plot  is  given  in  Fig.  12.20.  A  typical  recommended  value  for  m 
is  the  integer  part  of  yfp  in  a  classification  setting,  and  the  integer  part  of  p/3  for 
regression  (Hastie  et  al.  2009,  Sect.  15.3)  though  these  values  should  not  be  taken 
as  written  in  stone,  and  some  experimentation  should  be  performed. 

The  concept  of  only  taking  a  subset  of  variables  seems  totally  alien  statistically, 
since  information  is  reduced,  but  the  vital  observation  is  that  this  produces  classifiers 
that  are  close  to  being  uncorrelated.  This  is  a  key  difference  with  bagging,  with 


12.8  Classification 


641 


coma 

O 

coma 

O 

agec 

O 

agec 

O 

pup 

o 

pup 

o 

haem 

o 

haem 

o 

^ - 1 - 1 - 1 - 1 - r-^  ^ - 1 - 1 - 1 — 

0.00  0.02  0.04  0.06  0.08  0.10  0  10  20  30 


MeanDecreaseAccuracy  MeanDecreaseGini 

Fig.  12.21  Random  forest  variable  importance  for  one  split  of  the  outcome  after  head  injury  data. 
The  left  panel  shows  the  decrease  in  predictive  ability  (as  measured  by  the  misclassification  error) 
when  the  variable  is  permuted  and  the  right  panel  the  decrease  in  the  Gini  index  when  the  variable 
is  not  included  in  the  classification 


which  the  random  forests  method  shares  many  similarities.  By  injecting  randomness 
into  the  algorithm,  through  the  random  selection  of  covariates  at  each  split,  the 
constituent  trees  are  more  independent.  Selecting  a  random  set  of  m  covariates  also 
allows  random  forests  to  cope  with  the  situation  in  which  there  are  more  covariates 
than  observations  (i.e.,  n  <  p). 

As  we  have  noted,  random  forests  lose  the  relatively  simple  interpretation  of  tree- 
based  methods.  Although  prediction  is  the  objective  of  random  forests,  it  may  still 
often  be  of  interest  to  see  which  of  the  variables  are  making  contributions  to  the 
overall  prediction  (averaged  over  trees).  If  one  is  interested  in  gaining  this  insight 
into  which  predictors  are  performing  well,  then  two  measures  of  variable  importance 
are  popular.  One  approach  is  to  obtain  the  decrease  in  the  fitting  measure  each  time 
a  particular  variable  is  used.  The  average  of  this  decrease  over  all  trees  can  then  be 
calculated  with  important  variables  having  large  decreases.  For  regression  the  fitting 
measure  is  the  residual  sum  of  squares,  and  for  classification  it  is  often  the  Gini 
index  (Sect.  12.7).  The  right  panel  of  Fig.  12.21  shows  this  average.  This  measure 
seems  intuitively  reasonable,  but,  as  discussed  by  Berk  (2008,  Sect.  5.6.1),  it  has  a 
number  of  drawbacks.  First,  reductions  in  the  fitting  criteria  do  not  immediately 
translate  into  improvements  in  prediction.  Second,  the  decreases  are  calculated 
using  the  data  that  were  used  to  build  the  model  and  not  from  test  data.  Finally,  there 
is  no  absolute  scale  on  which  to  judge  reductions.  As  an  alternative,  one  may,  for 
each  predictor,  calculate  the  error  rate  using  a  random  permutation  of  the  predictor. 
The  difference  between  the  two  is  then  averaged  over  trees.  This  second  approach 
is  more  akin  to  setting  a  coefficient  to  zero  in  a  regression  model  and  then  assessing 
the  reduction  in  predictive  power  (say  in  a  test  dataset).  The  importance  of  each 
variable  is  assessed  by  creating  trees  using  random  permutations  of  the  values  of 
the  variable,  rather  than  the  variable  itself.  The  predictive  power  is  then  assessed 
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using  the  “true”  variable  in  the  oob  data  compared  to  the  permuted  version.  We  give 
more  detail  in  a  classification  setting  and  assuming  we  measure  the  predictive  power 
in  terms  of  the  misclassification  error  rate.  Suppose  that,  for  the  6th  tree,  this  rate  is 
vi,  when  using  all  of  the  true  variables  and  is  Vbj  when  the  jth  variable  is  shuffled. 
Then,  the  change  in  the  predictive  power  is  summarized  as 


(12.33) 


Note  that  if  the  variable  is  not  useful  predictively,  then  this  measure  might  by  chance 
be  negative.  The  left  panel  of  Fig.  12.21  shows  the  decrease  in  predictive  power  for 
each  of  four  variables. 


Example:  Outcome  After  Head  Injury 

We  now  compare  classification  methods  on  the  head  injury  data  described  in 
Sect.  7.2.1.  The  binary  response  is  outcome  after  head  injury  (dead/alive),  and 
there  are  four  discrete  covariates:  pupils  (good/poor),  coma  score  (depth  of  coma, 
low/high),  hematoma  present  (no/yes),  and  age  (categorized  as  1-25,  26-54,  >  55). 
We  found  in  Sect.  7.6.4  that  these  data  are  explained  by  relatively  simple  models. 
For  example,  a  model  with  all  main  effects  and  three  two-way  interactions  H .  P, 
H .  A,  P  .  A  had  a  deviance  of  13.6  on  13  degrees  of  freedom  which  indicates  a  good 
fit.  The  main  effects  only  model  has  a  deviance  of  34.1  on  18  degrees  of  freedom 
and  an  associated  p-value  of  1 .2%  so  although  not  a  good  fit,  it  is  not  terrible  either. 

The  approaches  to  prediction  we  compare  are  the  null  model,  main  effects  only 
model,  subset  selection  over  all  models  using  AIC  and  BIC,  unrestricted  subset 
selection  using  AIC  and  BIC,  classification  trees,  bagging  trees,  and  random  forests. 
We  looked  at  two  versions  of  AIC  and  BIC  with  one  enforcing  the  hierarchy 
principle  and  the  other  not.  The  random  forest  method  used  two  variables  to  split  on 
at  each  node.  In  this  example  there  are  just  four  covariates,  and  the  discrete  nature  of 
these  covariates  (each  with  few  levels)  and  the  good  fit  of  simple  models  indicates 
that  we  would  not  expect  to  see  great  advantages  in  using  tree-based  methods. 

We  split  the  data  into  training  and  test  datasets  consisting  of  70%  and  30%  of 
the  data,  respectively.  Each  of  the  methods  was  ran  100  times  for  different  splits  of 
the  data  and  then  the  misclassification  rates  were  recorded,  along  with  the  standard 
deviations  of  these  rates.  The  results  are  given  in  Table  12.1.  The  striking  aspect 
of  this  table  is  the  lack  of  a  clear  winner;  apart  from  the  null  model,  all  methods 
perform  essentially  equally. 

Figure  1 2.20  shows  the  oob  error  rate  as  a  function  of  the  number  of  trees.  We  see 
that  the  error  rate  stabilizes  at  around  300  trees.  Figure  12.21  shows  two  measures 
of  the  variable  importance  from  one  particular  split  of  the  data  (i.e.,  one  out  of  100). 
For  this  split,  coma  is  the  most  important  variable  for  classification,  with  hematoma 
the  least  important.  These  importance  measures  are  in  line  with  the  summaries  from 
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Table  12.1  Average  test  errors  over  100  train/test  splits  of  the  outcome  after  head  injury  data, 
along  with  the  standard  deviation  over  these  splits 


Null 

Main 

AIC  1 

BIC  1 

AIC  2 

BIC  2 

Tree 

Bagging 

Ran  For 

Mean 

50.5 

26.1 

26.1 

25.8 

26.1 

25.9 

25.4 

25.7 

25.6 

SD 

2.8 

2.2 

2.1 

2.3 

2.0 

2.4 

2.2 

2.3 

2.2 

AIC  1  and  BIC  1  enforce  the  hierarchy  principle,  while  AIC  2  and  BIC  2  do  not 


Table  12.2  Parameter 
estimates,  standard  errors, 
and  p- values  for  the  main 
effects  only  model  and  one 
split  of  the  outcome  after 
head  injury  data 


Estimate 

Std.  Err. 

p- value 

Haem 

0.169 

0.194 

0.386 

Pup 

1.26 

0.192 

<0.0001 

Coma 

-1.60 

0.198 

<0.0001 

Age  1 

0.724 

0.209 

0.00054 

Age  2 

2.36 

0.303 

<0.0001 

the  main  effects  only  model  presented  in  Table  12.2.  The  left-hand  panel  shows  that 
replacing  the  coma  score  with  a  permuted  version  leads  to,  on  average,  an  increase 
in  the  predictive  error  rate,  as  measured  by  (12.33),  of  around  10%.  In  contrast, 
replacing  the  hematoma  variable  with  a  permuted  variable  actually  gives  a  slight 
decrease,  indicating  that  this  variable  is  not  useful  for  forecasting  the  outcome  status 
(dead/alive)  of  a  child. 

This  example  is  not  typical  of  classification  problems  since  the  number  of 
predictors  is  so  small.  Exercise  12.2  describes  a  setting  that  is  more  usual. 


12.9  Concluding  Comments 

In  this  chapter  we  have  discussed  various  nonparametric  methods  for  prediction 
and  classification.  For  exploration  and  description  it  is  clear  that  the  GAM  models 
described  in  Sect.  12.2  are  very  useful.  Formal  inference  requires  more  care  since, 
as  we  have  seen  repeatedly,  the  appropriateness  of  inference  depends  critically 
on  smoothing  parameter  choice.  The  potential  loss  of  efficiency  as  compared  to 
a  parametric  approach  should  also  be  borne  in  mind. 

Classification  is  a  huge  topic,  and  the  surface  has  only  been  scratched  here 
with  a  focus  on  model-based,  as  opposed  to  algorithm-based,  techniques.  Bagging 
and  random  forests  have  been  included,  however,  to  provide  a  hint  of  the  algo¬ 
rithmic  approaches  that  are  available.  Neural  networks  (Ripley  1996;  Neal  1996), 
boosting  (Freund  and  Schapire  1997;  Friedman  et  al.  2000),  and  support  vector 
machines  (Vapnick  1996)  are  three  popular  classification  techniques  which  have 
not  been  discussed.  For  a  very  interesting  exposition  on  the  algorithmic  approach 
to  regression,  see  Breiman  (2001b)  and  the  accompanying  discussion.  We  have 
not  considered  large  datasets  in  this  chapter  and  in  particular  have  only  briefly 
discussed  the  situation  in  which  the  sample  size  is  small  relative  to  the  number 
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of  available  predictors  (the  “small  n,  large  p  problem”).  If  prediction  is  the  sole 
aim,  then  ensemble  methods  such  as  bagging,  random  forests,  and  Bayesian  model 
averaging  have  been  shown  to  be  very  powerful. 

This  chapter  has  almost  exclusively  considered  frequentist  approaches  to  predic¬ 
tion  and  classification  (apart  from  the  mixed  model  approach  to  fitting  GAMs).  We 
briefly  mention  some  Bayesian  approaches.  The  book-length  treatment  of  Denison 
et  al.  (2002)  describes  Bayesian  analogs  of  a  number  of  the  techniques  that  we  have 
discussed  including  spline  and  classification  models.  Bayesian  CART  models  are 
described  in  Chipman  et  al.  (1998).  Gaussian  process  models  are  an  important  topic 
that  are  considered  by  Rasmussen  and  Williams  (2006). 


12.10  Bibliographic  Notes 

An  influential  early  work  on  GAMs  is  the  book-length  treatment  of  Hastie  and 
Tibshirani  (1990).  Wood  (2006)  is  an  excellent  mix  of  the  theory  and  practice  of 
using  GAMs,  with  an  emphasis  on  thin  plate  regression  splines.  Ruppert  et  al.  (2003) 
also  consider  GAMs  from  a  mixed  model  standpoint.  Natural  thin  plate  splines  are 
described  in  Wabha  (1990)  and  Green  and  Silverman  (1994,  Chap.  7). 

Early  references  to  tree-based  strategies  include  Morgan  and  Sonquist  (1963), 
Morgan  and  Messenger  (1973),  and  Friedman  (1979).  Approaches  based  on  trees 
were  expanded  and  popularized  in  Breiman  et  al.  (1984).  Ripley  (1996),  Izenman 
(2008),  and  Berk  (2008)  describe  machine  learning  techniques  from  a  statistical 
perspective.  Hastie  et  al.  (2009)  is  a  broad  and  in-depth  treatment. 


12.11  Exercises 

12.1  For  model  (12.12),  form  and  graphically  display  (via  perspective  plots)  the 
16  tensor  product  bases  functions  with  Li  =  i2  =  2,  £n  =  £21  =  1/3, 
£12  =  £22  =  2/3. 

12.2  For  the  ethanol  data  in  the  R  package  SemiPar,  fit  a  tensor  product  cubic 
spline  model. 

12.3  Show  that  (12.24)  follows  from  (12.23). 

12.4  The  background  to  this  question  on  discriminant  analysis  can  be  found  in 
Sect.  12.8.2.  Suppose  that  under  the  two  classes,  X0  ~  N p(fx0,X)  and, 
independently,  Xi  ~  Np(/x1,  X).  Consider  the  statistic 

{EKXqI-EKXx]}2 

var(aTXo  —  aTX  1) 
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as  a  function  of  the  p  x  1  vector  a.  Show  that  a  oc  (ji{]  —  /v,  ,  j  maximizes 
the  statistic,  using  a  Lagrange  multiplier  approach.  Explain  why  this  result 
provides  one  justification  for  the  use  of  linear  discriminant  analysis. 

12.5  In  this  question,  the  famous  iris  data  analyzed  in  Fisher  (1936)  will  be 
considered.  These  data  may  be  found  at  the  book  website  and  contain  three 
classes  of  iris  (Setosa,  Versicolour,  and  Virginica)  and  four  covariates  (sepal 
length,  sepal  width,  petal  length,  and  petal  width)  all  measured  in  cm. 

(a)  Based  on  the  full  data  for  Setosa  and  Versicolour  only,  build  classifiers 
based  on  the  approaches  listed  below.  In  each  case,  explain  carefully  how 
you  implemented  the  approach,  and  provide  graphical  summaries  of  the 
output. 

(1 )  Linear  discriminant  analysis. 

(2)  Quadratic  discriminant  analysis. 

(3)  Linear  logistic  regression. 

(4)  Classification  trees. 

(5)  Bagging. 

(6)  Random  forests. 

(b)  Repeat  the  previous  part  for  the  data  on  all  three  classes  of  iris. 

12.6  At  the  book  website  of  Hastie  et  al.  (2009),  you  will  find  data  that  have 
been  extensively  used  to  test  binary  classification  methods.  The  data  concern 
4601  emails,  and  the  aim  is  to  predict  which  are  spam,  in  order  to  filter  out 
such  emails.  There  are  1813  spam  messages  and  57  potential  predictors  that 
concern  the  content  of  the  emails.  The  data  have  been  split  into  a  training 
set  of  3065  emails,  with  1536  remaining  for  testing  the  models.  Following 
Hastie  et  al.  (2009),  analyze  these  data  using  linear  logistic  regression,  a  GAM 
with  splines  having  fixed  degrees  of  freedom  equal  to  3  for  each  smoother, 
classification  trees,  bagging,  and  random  forests.  Summarize  your  findings 
based  on  the  test  error. 


Part  Y 
Appendices 


Appendix  A 

Differentiation  of  Matrix  Expressions 


For  univariate  x  and  /  : 


.  we  write  the  derivative  as 


dx 


We  define 

a 

dxi 

d 

dxp 

to  be  differentiation  with  respect  to  elements  of  a  vector  x  =  [xi , . . . , 
and  x  represent  p  x  1  vectors,  then 

i(«'x)  =  a=i(x'a), 

the  second  equality  arising  because  aTx  =  x'a.  Also 


d_ 

dx 


— — (cdx)  = 
8xtK  ' 


d  ,  T  ' 
di,ax) 


=  aT  =  ^-(xTa). 
8xtK  ’ 


Suppose  u  =  u(x)  is  an  r  x  1  vector  and  x  is  p  x  1.  Then 

duT 

dx 

is  a  matrix  of  order  p  x  r  with  (i,j) th  element 


duo  -i  -1 

— ,  i  =  l,...,p,  j  = 

OXi 


:p]T.  Let  a 

(A.l) 

(A. 2) 
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A  Differentiation  of  Matrix  Expressions 


The  transpose 

/<9uT\T  _  du 

y  dx  J  dxT 

is  a  matrix  of  order  r  x  p  with  (j,  fc)th  element 

duj  ,  , 

y, — ,  j  =  1, . . .  ,r,  k  =  l,...,p. 

axk 

For  example. 

Ox  dxT 

dx 1  dx  p' 

the  p  x  p  identity  matrix. 

Consider  the  matrix  A  of  dimension  p  x  p.  If  A  is  not  a  function  of  x: 


and 


If  u  =  u(x)  then 


and 


JL{Ax)  =  A^  =  A 


-(X-A}  =  -A  =  A. 


d  .  .  du 

— — (Au)  =  A— — , 
dxT  dxT 


f) 


Let  u  =  u(x)  and  v  =  v{x)  be  p  x  1  vectors.  Then  the  derivative  of  the  inner 
product  uJv  is 

d  duT  dvT 

—  {u  V)  =  - — v  +  - — u. 
dx  dx  dx 

If  A  is  again  a  p  x  p  matrix  then 

d  did  did 

^-(uTAu)  =  ^—Au+^—A^u. 
dx  dx  dx 


If  A  is  symmetric 


d  /  t  a  \  „  duT 

An)  =  2— An. 


In  particular,  for  a  quadratic  form 


-^-(x^Ax)  =  2  Ax. 
dx 


(A. 3) 
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Let  /  :  R.  — >  R.  then 


is  p  x  1  and 


d£ 

dx 

d  df  _  d2f 
dx  dx T  dxdxT 


is  a  p  x  p  matrix  with  elements 


d2f 

dxidxj ' 


i  =  lr...,p,  j  =  1, 


..,p. 


For  example,  with  p  =  2: 


d  df 

dx  dxT 


"  a 

(  df 

df  \  ' 

r  d2  f 

d2  f  1 

dxi 

[  dxi 

dX2  1 

dx\ 

dx\dx2 

a 

(  df 

df  \ 

a2.f 

d2.f 

dx2 

\  dxi 

dx  2  J 

_  dx^dxx 

dxi  . 

For  a  non-singular  p  x  p  matrix  A,  whose  elements  are  functions  of  x,  we  have 

dx  dx 


Also, 

d  .  . 

—  log|A|  =  tr 

The  trace  of  a  p  x  p  square  matrix  is 


v 

tr(-A)  =  ^2  a« 

i= 1 


Appendix  B 

Matrix  Results 


We  begin  with  two  properties  of  determinants: 

det(ATA)  =  det(AT)det(A)  =  det(A)2  (B.l) 

and 

=|  T  ||  W-VT-'U  |  .  (B.2) 

Let  A  be  an  n  x  n  non-singular  matrix,  which  we  express  as 

An  A12 

_  A21  A22  j 

where  An  is  k  x  k  and  A12  is  k  x  (n  —  k).  The  inverse  B  =  A"1  has  elements 

Bn  =  {An  —  AnA22  A21)  1 
B22  =  {A22  —  ^-21  ^-ii1  ^.12)  1 
B12  =  —An  A12B22 
B21  =  —A22  A2iBn- 

For  matrices  A ,  B  and  C  of  the  appropriate  dimensions: 

(A  +  BCBT)~l  =  A-1  -  A~1B{BTA~1B  +  C-1)-1BTA~1.  (B.3) 

We  now  describe  how  the  expectation,  variance  and  covariance  operators  deal  with 
vectors  of  random  variables. 
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B  Matrix  Results 


Suppose  U  is  an  n  x  1  vector  of  random  variables,  and  A  is  an  rn  x  n  matrix. 
Then 


E[AU\  =  A  E[E7] 
var (AU)  =  A  var([/)AT. 

Suppose  V  is  an  m  x  1  vector  of  random  variables.  Then  cov(U,  V)  =  C  is  an 
n  x  to  matrix  with  (z,  j)th  element  cov(Ui,  Vj),  i  =  1 ..,n,j  =  1, . . . ,  to.  Hence, 
cov(V,  U)  =  O'.  In  addition, 

co  v(J7,  ALT)  =  cov([/)AT 
cow{AU7U)  =Acov(U). 

The  iterated  expectation  and  covariance  formulas  are  given  by: 
E[y]=Ex[Er|x(F|X)] 

cov(y,  Z)  =  Ex  [cov^,  z  ,  x  (Y ,  Z  I  X)\  +  covx  [Ey  ,  x  (Y  \  X),  Ez  ,  x  (Z  \  X)} . 

Suppose  Z  is  an  n  x  1  random  variable  with  E[Z]  =  fi,  var (Z)  =  £  and  A  is 
a  symmetric  n  x  n  matrix.  Then 

E[ZTAZ]  =  tr(AI7)  +  ^Afi.  (B.4) 

See  Schott  (1997,  p.  391)  for  a  proof. 


Appendix  C 

Some  Linear  Algebra 


Bases 

Definition.  Let  S'  be  a  collection  of  to  x  1  vectors  satisfying  the  following: 

1.  If  xi  €  S  and  £  S,  then  S. 

2.  If  x  £  S,  and  a  is  any  real  scalar,  then  ax  £  S. 

Then  S  is  called  a  vector  space  in  m-dimensional  space. 

Definition.  Let  {x±, . . . ,  xn}  be  a  set  of  m  x  1  vectors  in  the  vector  space  S.  If 
each  vector  in  S  can  be  expressed  as  a  linear  combination  of  X\, ...  ,xn  then  the 
set  {xi, . . . ,  xn}  is  said  to  span  S. 

Definition.  Let  {xi, . . . .  xrl\  be  a  set  of  to.  x  1  vectors  in  the  vector  space  S. 
This  set  is  called  a  basis  if  it  spans  S  and  if  the  vectors  x\,...,xn  are  linearly 
independent. 
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Appendix  D 

Probability  Distributions  and  Generating 
Functions 


Continuous  Distributions 
Multivariate  Normal  Distribution 

The  p-dimensional  random  variable  X  —  [Xi, . . . ,  Xp]r  has  a  normal  distribution, 
denoted  Np(/x,  £),  with  mean  /x  =  [/xi, . . . ,  /xp]T  and  p  x  p  variance-covariance 
matrix  £  if  its  density  is  of  the  form 

p(x)  =  ( 2tt)~p /2  |  £  |-1/2  x  exp 

for  x  £  Mp,  fi  £W  and  non-singular  £. 

Summaries: 

E[X]  =  /x 
mode(X)  =  /x 
var(X)  =  £. 

Suppose 

'*1 

_X2 

where: 

•  Xi  and  Hi  are  r  x  1, 

•  X2  and  fi2  are  ip  ~~  r)  x  1> 

•  Vii  is  r  x  r, 

•  V12  is  r  x  {p  —  r),  V21  is  ip  —  r)  x  r, 

•  V22  is  ip  — r)  x  ip  -  r). 
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Mi 

M2 


Vn  V12 
V21  V22 


-^(*-m)t-£  V-m) 
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Then  the  marginal  distribution  of  X\  is 

Xi  ~Nr(/i1,Vn) 

and  the  conditional  distribution  X\  \  X2  =  x2  is 

X1\X2=x2~  Nr  [Ml  +  V12V221(x2  -  /*2),  Wn]  ,  (D.l) 

where  Wn  =  Vn  -  VnV^V-n. 

Suppose 

Yj  I 

for  j  =  1, . . . ,  J,  with  Yi, ...  ,Yj  independent.  Then,  if  Oi, . . . ,  Oj  represent 
constants, 

ajYj  ~  N 

If  Y  is  a  p  x  1  vector  of  random  variables  whose  distribution  is  N  ( // .  S)  and  A  is 
an  rxp  matrix  of  constants,  then 

AY  ~  N {An,  ASAT).  (D.3) 


E' 

j= i 


J 

El  2 
3=1 


(D.2) 


#<?ta  Distribution 


The  random  variable  X  follows  a  beta  distribution,  denoted  Be(a,  b),  if  its  density 
has  the  form: 

p(x)  =  B(a,b)-1xa-1(l-x)b~1, 

for  0  <  x  <  1  and  a,  b  >  0  and  where 


B(a ,  b) 


r{a)m 
r(a  +  b) 


L(1  —  zy  1  dz 


is  the  beta  function. 
Summaries: 


(D.4) 


E[X] 

mode(X) 

var(X) 


a 

a  +  b 
a  —  1 
ct  y  b  —  2 


for  a,  b  >  1 


ab 

(a  +  b)2(a  +  &+!)’ 
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Gamma  Distribution 


The  random  variable  X  follows  a  gamma  distribution,  denoted  Ga(a,  b),  if  its 
density  is  of  the  form 


p(x) 


r(a) 


exp  (—bx), 


for  x  >  0  and  a,  b  >  0. 
Summaries: 


E[X] 

mode(X) 

var(X) 


a 

b 


for  a  >  1 


a 


A  xt  random  variable  with  degrees  of  freedom  k  corresponds  to  the  Ga(/c/2, 1/2) 
distribution. 


Inverse  Gamma  Distribution 

The  random  variable  X  follows  an  inverse  gamma  distribution,  denoted  InvGa(a,  b), 
if  its  density  is  of  the  form 


p{x) 


(a+1)exp {-b/x), 

rW 


for  x  >  0  and  a,  b  >  0. 
Summaries: 


E[X] 


mode(A) 


var(A) 


for  a  >  1 


b 

ft  T  1 


b' 2 

(a  —  l)2(ft  —  2) 


for  a  >  2. 


If  Y  is  Ga(a,  b)  then  X  =  Y  1  is  InvGa(a,  b). 
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Lognormal  Distribution 


The  random  variable  X  follows  a  (univariate)  lognormal  distribution,  denoted 
LogNorm(/r,  cr2),  if  its  density  is  of  the  form 

V)2  - 

for  x  >  0  and  p  €  K,  a  >  0. 

Summaries: 


p(x)  =  (27tct2)  1/2—  exp 
x 


E[X]  =  exp(/r  +  cr2/2) 
mode(X)  =  exp(/T  —  cr2) 

var(X)  =  E[X]2  [exp(a2)  -  l]  . 

If  Y  is  N(p,  cr2)  then  X  =  exp(F)  is  LogNorm(/i,  a2). 


Laplacian  Distribution 


The  random  variable  X  follows  a  Laplacian  distribution,  denoted  Lap (p,  <j>),  if  its 
density  is  of  the  form 


p{x) 


1 

2cf> 


exp(—  |  x 


M  I  /<£), 


for  i  e  R,  /i  £  R  and  <j>  >  0. 
Summaries: 


E[X]  =  p 
mode(X)  =  p 
var(X)  =  2</>2. 


Multivariate  t  Distribution 

The  p-dimensional  random  variable  X  =  [Xi,...,Xp]T  has  a  (Student’s)  t 
distribution  with  d  degrees  of  freedom,  location  p  =  [pi, . . . ,  pp]T  and  p  x  p  scale 
matrix  S,  denoted  T P(p,  X,  d),  if  its  density  is  of  the  form 
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p(x) 


r[{d  +  p)/ 2]  .  . _i/2  [  {x-nYS  \x 

r{d/2)(dn)P /2'  1  [  d 


m) 


-(d+p)/2 


for  a;  G  Kp,  G  Rp,  non-singular  17  and  d  >  0. 
Summaries: 


E[X]  =  /x  for  d  >  1 
mode(X)  =  p 

var(X)  =  x  7E7  for  d  >  2. 

The  margins  of  a  multivariate  A  distribution  also  follow  t  distributions.  For  example, 
if  X  =  [Xi,  X^]T  where  Xi  is  r  x  1  and  X2  is  (j>  —  r)  x  1,  then  the  marginal 
distribution  is 

Xi  ~  Tr(p1,  Vlud), 
where  p1  is  r  x  1  and  Vn  is  r  x  r. 


F  Distribution 

The  random  variable  A'  follows  an  F  distribution,  denoted  F(a,  6),  if  its  density  is 
of  the  form 

a“/25b/2  x a/2-1 

~~  B(a/2,  6/2)  (6  +  aa:)(“+6)/2’ 

for  x  >  0,  with  degrees  of  freedom  a,b>  0  and  where  B(-,  •)  is  the  beta  function, 
as  defined  in  (D.4). 

Summaries: 


EfXl  =  — —  for  b  >  2 
1  J  6—2 

mode(X)  =  — — ^ —  for  a  >  2 
'  a  6  +  2 

262(a  +  6  —  2) 
a(6  —  2)2(6  —  4) 


var(X) 


for  6  >  4. 
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Wishart  Distribution 

The  px  p  random  matrix  X  follows  a  Wishart  distribution,  denoted  Wishp(r,  .S'),  if 
its  probability  density  function  is  of  the  form 

,  .  |  X  |(r— p-l)/2  f  1  1 

f(x)  =  T^rt{r,2)\s\^v[-2a(xS  r 

for  x  positive  definite,  S  positive  definite  and  r  >  p  —  1  and  where 

rp(r/ 2)  =  n  r[(r  +  1  -  j)/ 2] 

i= i 

is  the  generalized  gamma  function. 

Summaries: 

E[X]  =  rS 

mode(X)  =  (r  —  p  —  l)^  for  r  >  p  +  1 
va r(Xij)  =  riSfj  +  for  i,j  =  1, . . .  ,p. 

Marginally,  the  diagonal  elements  Xu  have  distribution  Ga[r/2, 1/(25^)],  i  = 

Taking  p  =  1  yields 

(9 

P(x)  =  exp(-x/2S), 

for  x  >  0  and  S,r  >  0,  i.e.  a  Ga[r/2, 1/(2S')]  distribution,  revealing  that  the 
Wishart  distribution  is  a  multivariate  version  of  the  gamma  distribution. 


Inverse  Wishart  Distribution 


The  p  x  p  random  matrix  X  follows  an  inverse  Wishart  distribution,  denoted 
InvWishp(r,  S  ),  if  its  probability  density  function  is  of  the  form 


p(x)  = 


|-(r+p+l)/2 


2rP/2rp{r/2 )  |  S  Y'2 


exp  — -tr(at  1 S)  , 


for  x  positive  definite,  S  positive  definite  and  r  >p-l. 
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Summaries: 

E[X]  = 
mode(X)  = 
var  (Xi:i)  = 


r  —  p  —  1 

s -1 

r  +p  +  1 


for  r  >  p  +  1 


(r-p+l)S^  +  {r-p-l)S^Sr/- 


for  i,j  =  1,.. .  ,p. 


(r  —  p)(i —  p  —  l)2(t —  p  —  3) 

If  p  =  1  we  recover  the  inverse  gamma  distribution  InvGa[r/2, 1/(25')]  with 


E[X]  = 
mode(X)  = 
var(X)  = 


1 


S(r~  2) 

1 

5(r  +  2) 


for  r  >  2 


1 


for  r  >  4. 


52(r  —  2)(r  —  4) 

If  Y  ~  Wishp(r,  5),  the  distribution  of  X  =  X”1  is  InvWishp(r,  S). 


Discrete  Distributions 
Binomial  Distribution 

The  random  variable  X  has  a  binomial  distribution,  denoted  Binomial(n,p),  if  its 
distribution  is  of  the  form 

Pr(X  =  x)=  (j^pF(l-p)n~x, 

for  x  =  0, 1, . . . ,  n  and  0  <  p  <  1. 

Summaries: 

E[X]  =  np 
var(X)  =  np(  1  —p). 
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Poisson  Distribution 

The  random  variable  X  has  a  Poisson  distribution,  denoted  Poisson(/t),  if  its 
distribution  is  of  the  form 


Pr(X  =  x) 


exp(— p)px 


for  n  >  0  and  x  =  0, 1,  2, ... . 
Summaries: 


E[X]  =  p 
var(X)  =  p. 


Negative  Binomial  Distribution 

The  random  variable  X  has  a  negative  binomial  distribution,  denoted  NegBin(/j,  b ), 
if  its  distribution  is  of  the  form 


Pr(X  =  x) 


P(x  +  b )  (  »  \x  (  b  \b 

r(x  +  i)r(b)  \n  +  b )  \fi  +  b)  ' 


for  fi  >  0,  b  >  0  and  x  =  0, 1, 2, . . . 
Summaries: 


E[X]  =  fM 

var(X)  =  n  +  fi2 /b. 

The  negative  binomial  distribution  arises  as  a  gamma  mixture  of  a  Poisson  random 
variable.  Specifically,  if  X  \  /x,  <5  ~  Poisson(/u(5)  and  8  \  b  ~  Ga(6,  6),  then  X  \ 
H,b~  NegBin(/x,  b). 

We  link  the  above  description,  motivated  by  a  random  effects  argument,  with 
the  more  familiar  derivation  in  which  the  negative  binomial  arises  as  the  number 
of  failures  seen  before  we  observe  b  successes  from  independent  trials,  each  with 
success  probability  p  =  p/(p,  +  b).  The  probability  distribution  is 

Pr(X  =  x)=(j  +  bx~1^jpx(l-p)b, 

for  0<p<l,6>0an  integer,  and  x  =  0, 1,  2, . . . 
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Summaries: 


E[X]  = 


pb 

1  ~P 


var(X) 


pb 


Generating  Functions 

The  moment  generating  function  of  a  random  variable  Y  is  defined  as  My(t)  = 
E[exp(iY)],  for  1  £  R,  whenever  this  expectation  exists.  We  state  three  important 
and  useful  properties  of  moment  generating  functions: 

1.  If  two  distributions  have  the  same  moment  generating  functions  then  they  are 
identical  at  almost  all  points. 

2.  Using  a  series  expansion: 


y.2y2  /3y3 

exp(tY)  =  l  +  fY+  — +  —  +  ... 


so  that 


My(t)  =  1  +  tmi  + 


t2m.2 

2! 


t3m3 

3! 


where  rrij  is  the  ith  moment.  Hence, 


E[Y®]  =  My\o) 


cfMy 

dtn 


t= o 


3.  If  .  Yn  are  a  sequence  of  independent  random  variables  and  S  = 

Sj=i  ai^i’  ai  constant,  then  the  moment  generating  function  of  S  is 


Ms(t)  =  ]^[  My  (a it). 
i=l 

The  cumulant  generating  function  of  a  random  variable  Y  is  defined  as 
Cy(t)  =  logE[exp(fY)] 


for  f  eK. 


Appendix  E 

Functions  of  Normal  Random  Variables 


If  Yj  ~  N(0, 1),  j  =  1, . . . ,  J,  with  Yi, . . .  ,Yj  independent,  then 


i=i 


Xj, 


(E.l) 


a  chi-squared  distribution  with  J  degrees  of  freedom  and  E [Z]  =  J ,  var(  Z )  =  2  J. 
If  X  ~  N(0, 1),  Y  ~  Xd’  wit^1  X  anc*  Y  independent,  then 


X 

(Y/dy/z 


T(0, 1,  d), 


(E.2) 


a  Student’s  t  distribution  with  d  degrees  of  freedom. 

If  U  ~  Xj  and  V  ~  ,  with  U  and  V’  independent,  then 


U/J 

V/K 


F  (J,K), 


the  F  distribution  with  J,  K  degrees  of  freedom. 


(E.3) 
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Appendix  F 

Some  Results  from  Classical  Statistics 


In  this  section  we  provide  some  definitions  and  state  some  theorems  (without 
proof)  from  classical  statistics.  More  details  can  be  found  in  Schervish  (1995).  Let 
y  =  [j/i, ,  yn]T  be  a  random  sample  from  p(y  \  9). 


Definition.  The  statistic  T(Y)  is  sufficient  for  6  within  a  family  of  probability 
distributions  p{y  \  9)  if  p(y  \  T(y))  does  not  depend  upon  9. 

Theorem.  The  Fisher-Neyman  factorization  theorem  states  that  T(Y)  is  sufficient 
for  9  if  and  only  if 

P(y  I  0)  =  g[T(y)  |  9]  x  h{y). 

Intuitively,  all  of  the  information  in  the  sample  with  respect  to  9  is  contained  in 
T(Y). 

Definition.  The  statistic  T(Y)  is  minimal  sufficient  for  9  within  a  family  of 
probability  distributions  p(y  \  9)  if  no  further  reduction  from  T  is  possible  while 
retaining  sufficiency. 

Theorem.  The  Lehmann— Scheffe  theorem  states  that  ifT(Y )  satisfies  the  following 
property:  for  every  pair  of  sample  points  y ,  z  the  ratio  p(y  \  9)/p(z  \  9)  is  free  of 
9  if  and  only  ifT(y)  =  T(z),  then  T  is  minimal  sufficient. 

Example.  Let  Y\ , . . . ,  Yn  be  independent  and  identically  distributed  from  the  one- 
parameter  exponential  family  of  distributions: 

p(y  I  0)  =  exp  [6T(y)  -  b{9)  +  c(y)\ 

for  functions  &(•)  and  c(-).  Then  Y17-1  T{Yi)  is  sufficient  for  9  by  the  factorization 
theorem  and  minimal  sufficient  by  the  Lehmann-Sheffe  theorem. 

Definition.  A  statistic  V  =  V  (y  )  is  ancilliary  for  9  within  a  family  of  probability 
distributions  p(y  \  9)  if  its  distribution  does  not  depend  on  9. 
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Definition.  If  a  minimal  sufficient  statistic  is  T  =  [Ti ,  Tf[  and  7  2  is  ancillary  then 
T\  is  called  conditionally  sufficient  given  T2. 

Example.  In  a  linear  normal  linear  regression  with  covariate  x,  suppose  that  x  has 
distribution  p(x)  and 

Yi  |  Xi  =  Xi  ~  N(/30  +  PiXi,  a2),  i  =  1, . . . ,  n. 

Then,  letting  x  =  [xi, . . . ,  xn]T,  y  =  [yi, . . . ,  yn]T  and  (3  =  [/30, /?i]T  the 
distribution  for  the  data  is 


p(x,y  \  (3,a2)  =  p{x)(2na)  n/2  exp 


1 

2^2 


^2(Vi  ~  Po  -  Pi Xif 

i= 1 


The  sufficient  statistic  for  [/3,  a2]  is 


S  = 


n  n 

P^^Xi^X2  , 
i=  1  *=  1 


with  the  last  two  components  being  an  ancillary  statistic. 

Definition.  A  statistic  T  is  complete  if  for  every  real- valued  function  //(■),  E[//(7’)]  = 
0  for  every  6  implies  g(T)  =  0. 

Definition.  Suppose  we  wish  to  estimate  p  =  </>(#)  based  on  Y  \  6  ~  p(-  \  6). 
An  unbiased  estimator  (f>  of  <f>  is  a  uniformly  minimum-variance  unbiased  estimator 
(UMVUE)  if,  for  all  other  unbiased  estimators  <j>, 

var (<j>)  <  var((/>) 


for  all  9. 

Lemma.  IfT  is  complete  then  <j>{6)  admits  at  most  one  unbiased  estimator  (f>(T) 
depending  on  T. 

Theorem  (Rao-Blackwell-Lehmann-Scheffe).  Let  T  =  T(Y)  be  complete  and 
sufficient  for  9.  If  there  exists  at  least  one  unbiased  estimator  (/>  =  (f>(Y)  for  (f>(9) 
then  there  exists  a  unique  UMVUE  <fi  =  <f(T)  for  <j>(9),  namely , 

4>{Y)  =  E${Y  |  T], 

Corollary.  LetT  =  T(Y)  be  complete  and  sufficient  for  9.  Then  any  function  g(T) 
is  the  UMVUE  of  its  expectation  E[g(T)]  =  4>{9). 

Theorem.  The  Cramer— Rao  lower  bound  for  any  unbiased  estimator  <f>  of  a  scalar 
function  of  in  terest  cf>  =  cj)(9)  is 
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var  (<jj)  > 


where  1(0)  =  ^”=1  l°g  p(Ui  I  0),  is  the  log  of  the  joint  distribution,  viewed  as  a 
function  of  0.  Equality  holds  if  and  only  if  p(y  \  0)  is  a  one-parameter  exponential 
family. 


Appendix  G 

Basic  Large  Sample  Theory 


We  define  various  quantities,  and  state  results,  that  are  useful  in  various  places  in 
the  book.  The  presentation  is  informal  see,  for  example,  van  der  Vaart  (1998)  for 
more  rigour. 


Modes  of  Convergence 

Suppose  that  Yn,  n  >  1,  are  all  random  variables  defined  on  a  probability  space 
(17,  A,  P)  where  17  is  a  set  (the  sample  space)  A  is  a  a- algebra  of  subsets  of  17,  and 
P  is  a  probability  measure. 

Definition.  We  say  that  Yn  converges  almost  surely  to  Y,  denoted  Yn  —>a.s.  Y,  if 
Yn(cj)  — >  Y{uj)  for  all  u  €  A  where  P(AC)  =  0  (G.l) 

or,  equivalently,  if,  for  every  e  >  0 

P  (  sup  |  Ym  —  Y\  >  e  j  — >  0  as  n  — >  oo.  (G.2) 

\m>n  / 

Definition.  We  say  that  Yn  converges  in  probability  to  Y,  denoted  Yn  — >p  Y ,  if 

P{\Ym  -Y\  >  e)  ->  0  as  n->oo.  (G.3) 

Definition.  Define  the  distribution  function  of  Y  as  F(y)  =  Pr( Y  <  y ).  We  say 
that  Yn  converges  in  distribution  to  Y,  denoted  Yn  Y,  or  Fn  — >  F,  if 

Fn(y)  — >■  F(y)  as  n  — >  oo  for  each  continuity  point  y  of  F.  (G.4) 
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G  Basic  Large  Sample  Theory 


Limit  Theorems 

Proposition  (Weak  Law  of  Large  Numbers).  If  Yi,  Y2, . . . ,  Yn, . . .  are  indepen¬ 
dent  and  identically  distributed  (i.i.d.)  with  mean  p  =  E[Y]  (so  E[|Y|]  <  oo)  then 
Y n  ~^p  T- 

Proposition  (Strong  Law  of  Large  Numbers).  IfYi,Y2, . . . ,  Yn, . . .  are  i.i.d.  with 
mean  p  =  E[Y]  (so  E[|Y|]  <  oo)  then  Yn  —}a,a.  p. 

Proposition  (Central  Limit  Theorem).  If  Y\  .  Y2 . . . . ,  Yn  are  i.i.d.  with  mean  p  = 
E[Y]  and  variance  a2  (so  E[Y2]  <  oo),  then  y/n(Y n  —  p)  — N(0,  a2). 

Proposition  (Slutsky’s  Theorem).  Suppose  that  An  — >p  a,  Bn  — >p  b,  for 
constants  a  and  b,  and  Yn  —>d  Y.  Then  AnYn  +  Bn  —>d  o.Y  +  b. 

Proposition  (Delta  Method).  Suppose  y/n  (Yn  —  fi)  — Z  and  suppose  that 
g  :  — >  Kfe  has  a  derivative  g'  at  p  (here  g1  is  a  k  x  p  matrix  of  derivatives). 

Then  the  delta  method  gives  the  asymptotic  distribution  as 

Vn-  [ g(Y )  -  g(p)\  g'(p)Z. 


If  Z  ~  Np(0,  £),  then 

y/n  [g{Y)  -  g(p)}  ->d  Nfc  \0,g' (p)Zlg' {p)T} . 
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grouped,  525 
Least  squares 

generalized,  214,  249,  289,  358 
ordinary,  214 
weighted,  214 

Lehmann-Scheffe  theorem,  669 
Leverage,  242 

linear  smoother,  532 
Likelihood,  36-49 

comparison  with  quasi-likelihood,  59 
conditional,  4, 44,  327-330 
definition  of,  36 
marginal,  44-45 

marginal,  normal  example,  45^16 
maximum  likelihood  estimation,  36 
profile,  46 

Likelihood  principal,  143 
Likelihood  ratio  test,  75-76,  154 
GLMs,  267-269 
REML,  371 

variance  components,  368 
Limit  theorems,  673-674 
Lindley  paradox,  161-164 
Linear  Bayes  method,  144 
Linear  mixed  models,  358-390 

assessment  of  assumptions,  400-407 
Bayesian  computation,  385-388 
Bayesian  inference,  381-390 
covariance  models,  360-363 
likelihood  inference,  364-381 
likelihood  inference  for  fixed  effects, 
365-367 

likelihood  inference  for  random  effects, 
376-381 

likelihood  inference  for  variance 
components,  367-371 
parameter  interpretation,  363-364 
Linear  models 

assessment  of  assumptions,  239-245 
Bayesian  inference,  221-224 
justification,  198 


least  squares  estimation,  213-215 
likelihood  estimation,  209-212 
likelihood  inference,  209-213 
likelihood  testing,  212-213 
robustness,  236-239 
robustness  to  correlated  errors,  238-239 
robustness  to  distribution  of  errors,  237 
robustness  to  non-constant  variance, 
237-238 

solution  locus,  291 
Linear  predictor,  257 
Linear  smoother,  520,  527,  532,  552,  558 
inference,  560-563 
local  polynomial  regression,  582 
variance  estimation,  584 
Link  function,  257 

binary  data,  312-313 
canonical,  257,  264,  312 
complementary  log-log,  312 
linear,  265 
log-log,  312 
probit,  312 
reciprocal,  258 
Local  likelihood,  592 
Local  polynomial  regression,  580-584 
Locally  weighted  log-likelihood,  592 
Locally  weighted  sum  of  squares,  581 
Logistic  regression,  3,  9,  316-327 
Bayesian  inference,  321-322 
classification,  5 1 1 
likelihood  inference,  318-321 
parameter  inteipretation,  316-318 
use  in  case-control  study,  338-343 
Loglinear  models,  12 
binary  data,  468^170 
Lognormal  distribution,  659-660 
as  a  prior,  276-278 

Longitudinal  data,  4,  18,  353,  413-415 
efficiency  of  analysis,  356-358 
Loss  functions 

absolute,  prediction,  507 
deviance,  prediction,  510 
posterior  summarization,  88 
quadratic,  Bayes  estimation,  88 
quadratic,  frequentist  estimation,  8 1 
quadratic,  prediction,  507 
scaled  quadratic,  prediction,  507 
Stein,  frequentist  estimation,  8 1 
Low  rank  smoother,  558 

M 

Mahalanobis  distance,  626 
Mallows  CP,  184,  527-529,  533 
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Marginal  modeling,  1 8 
Markov  chain  Monte  Carlo,  121-134 
batch  means,  126 
blocking,  127 
burn-in,  125 

convergence  diagnostics,  127 
hybrid  schemes,  125 
independence  chain,  123 
Metropolis  within  Gibbs,  125 
multiple  chains,  127 
thinning,  127 
Markov  chains 
aperiodicity,  122 
definition  of,  121 
ergodic  theorem,  122 
global  balance,  121 
homogeneous,  121 
invariant  distribution,  1 2 1 
irreducibility,  122 
reversible,  121 
transition  kernel,  121 
Markov  random  field  priors,  445^147 
Masking,  240 
Matrix  differentiation,  65 1 
Matrix  results,  653-654 
Maximum  penalized  likelihood  estimator,  519 
Mean  squared  error 

of  Bayes  estimator,  96 
of  frequentist  estimator,  29 
of  positive  FDR,  170 
of  prediction,  513,  514 
random  effect  prediction,  377 
Mean-variance  models 

linear  and  quadratic  variance  functions, 
54-56 

Method  of  moments,  5 1 
GEE,  397 

GLM  scale  parameter,  262 
negative  binomial  scale  parameter,  55 
quasi-Poisson  scale  parameter,  61 
Metropolis  algorithm,  123 
Metropolis-Hastings  algorithm,  122-123 
generalized  linear  mixed  model,  44 1 
linear  mixed  model,  389 
logistic  regression  model,  325 
loglinear  model,  276 
nonlinear  mixed  model,  483 
nonlinear  model,  30 1 
normal  linear  model,  224 
Michaelis-Menten  Model,  283-284 
Misclassification  error,  515 
classification,  635 
Missing  data,  4 
Mixed  models 


binary  data,  458^-62 
generalized  linear,  429^149 
linear,  358-390 
nonlinear,  475^-87 
Mixture  model,  175 
MLE 

asymptotic  distribution,  39 
binomial  model,  40 — 41 
model  misspecification,  46^-9 
Poisson  model,  41 — 43 
Weibull  model,  43 
Model  misspecification,  18 

behavior  of  Bayes  estimators,  98-100 
Modes  of  convergence,  673 
Monte  Carlo  test,  72 
Multinomial  distribution,  626 
Multinomial  distribution 
conjugate  prior  for,  149 
Multiple  adaptive  regression  splines,  620-624 
Multiple  linear  regression,  196 
Multiple  testing,  164-179,225 
Multivariate  binary  models,  468^173 


N 

Nadaraya-Watson  kernel  estimator,  578-580 
Naive  Bayes,  630 

Negative  binomial  distribution,  54,  81,  258, 
664-665 

quasi-likelihood  version,  5 1 
Negative  predictive  value,  516 
Newton-Raphson  method,  263, 432 
Neyman-Pearson  lemma,  154 
Neyman-Scott  problem,  93 
Neyman-Scott  problem,  82,  145,  146,418 
NIC,  537 

Nominal  variable,  197,  625 
Nonlinear  mixed  models,  475^-87 
Bayesian  inference,  481-483 
likelihood  inference,  478^180 
Nonlinear  models,  283-284 

assessment  of  assumptions,  297-298 
Bayesian  estimation,  294 
Bayesian  inference,  293-295 
geometry,  290-293 
hypothesis  testing,  287-288 
identifiability,  284 
intrinsic  curvature,  292 
least  squares  estimation,  288-290 
likelihood  estimation,  285-286 
likelihood  inference,  284-288 
parameter  effects  curvature,  292 
sandwich  estimation,  290 
solution  locus,  291 
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Nonparametric  regression,  19,  24 
Normal  distribution 
multivariate,  657-658 
quasi-likelihood  version,  5 1 
Normal  linear  model 

conjugate  prior  for,  223 

O 

Observational  data,  2 
Odds  ratio 

Bayesian  estimation,  146 
conditional,  468^-7 1 
interpretation,  317 
marginal,  472^-73 
Ordinal  variable,  197 
Out  of  bag  estimate,  639,  640 
Outer  iteration,  608 
Outliers,  240 
Overdispersion 

Bernoulli  random  variable,  308 
beta-binomial  distribution,  105 
binomial  data,  313-316 
GLMs,  275-276 
Overfitting 

prediction,  529 


P 

Parameter  interpretation,  7,  8 

Bayesian  modeling  averaging,  101 
factors,  202-205 
GEE,  394 

GEE,  nonlinear  models,  487 — 488 
GLMs,  259-260 
linear  mixed  models,  373-375 
linear  models,  198-209,  232-233 
logistic  regression,  case-control  study, 
339-340 

logistic  regression,  cohort  study,  339 
marginal  versus  conditional  models, 
454-456 

multiple  linear  regression,  201 
nonlinear  mixed  models,  477 
Poisson  conditional  models, 

434-436 

prior  specification,  200 
quadratic  model,  202 
Parameterization 

of  a  nonlinear  model,  477,  478 
Parsimony,  4 

Pearson  statistic,  51,  268,  320 
Penalization,  518 
Penalized  IRLS,  432 


Penalized  IRLS,  589 
Penalized  least  squares,  519 
nonlinear  mixed  models,  479 
spline  models,  553 
Penalized  likelihood,  591 
Penalized  quasi-likelihood,  432,  591 
Performance  iteration,  608 
Pharmacokinetics 

compartmental  models,  14 
general  description,  12-16 
one  compartment  model,  14 
Piecewise  polynomials,  547-552 
Plug-in  estimators,  577,  634 
Poisson  distribution,  663-664 
conjugate  prior  for,  148 
quasi-likelihood  version,  5 1 
Poisson  process,  22 
Positive  false  discovery  rate,  169 
Positive  predictive  value,  516 
Power  variance  model,  492 
Prediction 

random  effects,  377-380 
Predictive  distribution 
Bayes,  89 

with  conjugate  prior,  103 
Predictive  models,  2,  10 
Predictive  risk,  513 
Prior  choice,  90-98 
baseline  priors,  90-93 
generalized  linear  mixed  models,  441^-42 
GLMs,  273 

improper  posterior,  91,92 
improper  prior,  91,  175 
improper  prior  for  linear  model,  22 1 
improper  prior  for  nonlinear  mixed  model. 

482 

improper  prior  for  Poisson  model,  273-274 
improper  spatial  prior,  446 
lognormal  distribution,  276-278 
nonlinear  models,  294 
objective  Bayes,  90-93 
proper  priors  for  hierarchical  models,  384 
substantive  priors,  93-95 
Probability 

meaning  of,  94 
Projection  matrix,  241 
Pure  significance  test,  72,  154 


Q 

QQ  plot,  243 

mixed  models,  403, 489 
Quadratic  exponential  model,  454 
Quadrature,  107-109 


696 


Index 


Quasi-likelihood,  49-56 
binomial  data,  321 
binomial  overdispersion,  315 
hypothesis  testing,  76,  271 
prediction,  53 

Quasi-score  function,  50,  51,  58 


R 

Random  effects,  355 
ANOVA,  230 
interpretation,  356 
Random  forests,  639-642 
variable  importance,  641 
Random  intercepts  and  slopes  model,  361 
Random  intercepts  model,  360 
Randomization,  2,  200,  201 
Randomized  block  design,  227 
Rao-Blackwell-Lehmann-Scheffe  theorem, 
670 

Receiver  operating  characteristic,  517 
Reflected  pair,  622 
Regression  trees,  613-624 
bias- variance  trade-off,  619 
greedy  algorithm,  617 
hierarchical  partitioning,  614-621 
leaves,  615 
missing  data,  619 
missing  data,  619 
nodes,  615 
overfitting,  618 
pruning,  617 
root,  615 

rooted  subtree,  615 
size,  615 
subtree,  615 

weakest-link  pruning,  618 
Regularization,  504 
Rejection  algorithm,  114 
sampling  from  prior,  116 
Relative  risk 

case-control  study,  339 
Repeated  measures  data,  18,  353 
Residuals 

deviance,  279 

deviance,  binomial  models,  332 
GLMs,  278-280 
linear  models,  240-245 
nonlinear  models,  298 
Pearson,  278 

Pearson,  binomial  models,  33 1 
Pearson,  mixed  models,  489 
population-level,  mixed  models,  402 


standardized  population-level,  mixed 
models,  402 

standardized  unit-level,  mixed  models,  403 
standardized,  linear  models,  242 
to  determine  form  of  overdispersion,  316 
to  investigate  mean-variance  relationship, 
280-283 

unit-level,  mixed  models,  402 
Restricted  maximum  likelihood,  368-37 1 
Ridge  regression,  247,  376,  517-522 
Bayesian  formulation,  520 


S 

Sampling  from  the  posterior 

directly  using  conjugacy,  112-1 14 
directly  using  the  rejection  algorithm, 
114-117 

Sandwich  estimation,  35,  56-63 
GEE,  393,400-401 
linear  models,  2 1 6-2 1 8 
model  misspecification,  47 
Poisson  models,  6 1 ,  62 
relationship  with  the  bootstrap,  66-69 
Saturated  model,  204 
Schwarz  criterion,  140 
Score  function,  37,  38 
binomial,  318 
conditional,  329 
GLM,  260 

negative  binomial,  55 
nonlinear  model,  285 
Score  test,  74-75 
Second-order  stationary,  404 
Semi-variogram,  243,  404 
Sensitivity,  516 
Shrinkage,  517-526 

ridge  regression,  520,  521 
Shrinkage  methods,  504 
Sidak  correction,  167 
Significance  level,  72 
Simple  linear  regression,  196 
Simpson’s  paradox,  335,  349 
Slutsky’s  theorem,  674 
Smoothing  parameter  selection 
multiple  predictors,  608-610 
Soft  thresholding,  524 
Span,  655 

Spatial  dependence,  12,  23,  445449 
Specificity,  516 
Splines,  547-572 
B-splines,  555-557 
M-th  order,  549 
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O-splines,  559 
P-splines,  559 
cubic  smoothing,  553-555 
GLMs,  587-593 

mixed  model  representation,  563-572 
multiple  predictors,  600-606 
natural  cubic,  552-553 
natural  thin  plate,  60 1-603 
O’Sullivan,  559 

penalized  regression  splines,  557-560 
regression,  557 
smoothing,  555 
summary  of  terminology,  560 
tensor  product,  604-606 
thin  plate  regression,  603-604 
Split-plot  design,  353 
Strong  law  of  large  number,  674 
Student’s  t  distribution,  667 
importance  sampling,  112 
multivariate,  660-661 
Sufficient  statistic,  669 

conditional  likelihood,  438 
conditionally,  670 
conjugacy,  102 
marginal  likelihood,  44 
minimal,  669 
prior,  103 
UMVUE,  30 

Sum-to-zero  constraint,  202 
crossed  design,  227 
one-way  ANOVA,  225 
Supemonnality,  243 
Superpopulation,  3 
Survey  sampling,  3 

T 

Test  data,  512,  515 
Test  of  significance,  72 
Toeplitz  error  model,  363 
Tolerance  distributions,  317-318 


Training  data,  512,  513,  515 
Transformations  of  the  data,  205-209 
True  positive  fraction,  516 
Truncated  line,  550,  622 
Type  I  error,  154,  155,  516 
multiple  testing,  165 
Type  II  error,  154 

multiple  testing,  165 


U 

Unidentifiability,  16 
Uniformly  minimum  variance  unbiased 
estimator,  30,  8 1 ,  670 
Unstructured  error  model,  363 


V 

Variable  selection 

all  possible  subsets,  183-185 
backward  elimination,  1 8 1 
Efroymson’s  algorithm,  181 
forward  selection,  1 8 1 
procedures,  179-185 
stepwise  methods,  181-183 
Variance  estimation 

nonparametric  regression,  584-587 
Varying-coefficient  models,  610-613 
Vector  space,  655 


W 

Wald  test,  75 

Weak  law  of  large  number,  674 
Wishart  distribution,  382,  661-662 
prior,  nonlinear  mixed  models,  484 
Working  variance  model,  58 
GEE,  binary  data,  468 
GEE,  GLMs,  451^152 
GEE,  linear  models,  391-394 


